tBuLi / symfit

Symbolic Fitting; fitting as it should be.
http://symfit.readthedocs.org
MIT License
233 stars 17 forks source link

Parameterizing options #254

Closed egpbos closed 5 years ago

egpbos commented 5 years ago

I have an issue with the parameterization of a plane (same model as in #249). The model is built as follows:

# parameters that characterize the plane (a point x0,y0,z0 and a normal vector a,b,c)
a, b, c, x0, y0, z0 = sf.parameters('a, b, c, x0, y0, z0')
# and the equation that describes the plane
x, y, z = sf.variables('x, y, z')
plane_model = {x: (x0 * a + y0 * b + z0 * c - y * b - z * c) / a}

However, when I now create a noisy plane like this:

# generate points in a plane with some noise in the y direction
xyz = np.array((np.random.random(1000), np.random.normal(0, 0.01, 1000), np.random.random(1000)))

the best fit for a is going to be zero, but I'm dividing by a in the model. Symfit cannot seem to converge with this model.

An quick and dirty solution for this is to reparameterize in terms of y:

plane_model = {y: (x0 * a + y0 * b + z0 * c - x * a - z * c) / b}

In fact, this is the only possible parameterization in this case, because c (the z component of the normal) should also fit to 0.

The problem is then that this doesn't generalize. This way of parameterizing planes forces me to actually build three models and then maybe try one and if it doesn't converge try the others until one of them works. In more complicated models with more parameters and variables, this would create a lot of extra work.

I'm hoping that there is some different way to write down this model. In fact, mathematically, it's quite easy, because the actual formula I'm deriving the above model(s) from is

I tried adding an lhs and rhs variable and then constraining them to be equal, but obviously this doesn't work, since I don't have data for those variables.

Any ideas?

egpbos commented 5 years ago

Using lhs and rhs as parameters doesn't work either. For completeness, what I wanted to do with those two is use them in the model like this:

plane_model = {lhs: x * a + y * b + z * c,
               rhs: x0 * a + y0 * b + z0 * c}

In the variable case it complains about missing data for fitting (missing positional arguments lhs and rhs) and in the parameter case it says "ModelError: Parameter's can not feature in the role of Variable"

egpbos commented 5 years ago

The most naive option I think would be

plane_model = {x * a + y * b + z * c: x0 * a + y0 * b + z0 * c}

I didn't even try this initially, because I assumed it would fail. Indeed it does with exception AttributeError: 'Add' object has no attribute 'name'.

Would it actually be possible to implement this so that it does work? I guess you could in principle give Add objects names, e.g. simply use the names of the additives and put a plus between them.

tBuLi commented 5 years ago

Hi @egpbos, nice to see you decided to use symfit ;).

Can you tell me if I understood your problem correctly, so I can be sure I answer the question correctly.

You have data (x, y, z), which you suspect belong to a plane. The equation of a plane is a x + b y + c z = d, and so you want to find the parameters a, b, c, d? (As I see it, you treat d as d = a x_0 + b y_0 + c z_0 but that seems unnecessarily complicated.)

There are a couple of models you could try, but indeed like you say, it's best to avoid division by zero. My suggestion would therefore be to solve a x + b y + c z - d = 0 instead:

a, b, c, d = sf.parameters('a, b, c, d')
x, y, z, f = sf.variables('x, y, z, f')
plane_model = {f: a * x + b * y + c * z - d}

fit = Fit(plane_model, x=xdata, y=ydata, z=zdata, f=np.zeros_like(zdata))

I might have missed something, but perhaps this will get you in the right direction.

About your toy model, I'm not sure it actually describes a plane, right? Because the noise is independent, but the underlying plane is not. So I think you should generate a plane obeying some known a, b, c, d and then add noise afterwards to get a good test.

As a final remark, the lhs in a model dict always has to be just one variable, hence the name error.

egpbos commented 5 years ago

Ah, adding the f with zeros is an interesting suggestion!

Unfortunately, the fit I get from this is worse than that which I get with the other model. The plane is now slightly tilted, see screenshot:

Schermafbeelding 2019-06-19 om 15 59 02

With the previous method, no such tilt is present, the plain neatly intersects the points:

Schermafbeelding 2019-06-19 om 16 00 48

Any idea how your method could be further improved?

By the way, I used x_0 etc instead of d, because I thought this would make my plane plotting function easier to use. (a,b,c) is the normal vector and (x_0,y_0,z_0) is a point in the plane. In a graphical environment you can make educated guesses for these parameters. Of course, the d parameter can be easily defined in terms of these two vectors, so I could still use the two vectors in a user interface, while using d for calculations. So thanks for that as well :)

Actually, in the above two fits I was stubborn and still used (x_0,y_0,z_0). When I use d instead, the fit turns out perfectly again:

Schermafbeelding 2019-06-19 om 16 50 12

So great, problem solved, thanks!

I do have some slight remaining concerns about what treating f as a noisy variable (from the modeling perspective) means for the quality of the fit as a whole. Correct me if I'm wrong, but if we're adding here an additional, artificial degree of freedom, wouldn't that make the fit harder? How do the fitting procedures cope with this? Will it be content with less precision in the actual true degrees of freedom?

egpbos commented 5 years ago

Because the noise is independent, but the underlying plane is not.

I didn't quite get what you mean by this, by the way. What I'm trying to do here is creating a mock scan of some true perfect plane (with a, b, c = 0, 1, 0 at y = 0) with some scanning noise in the y direction and randomly scattered samples in the other two directions. Are you suggesting I first sample x, y and z regularly in the actual plane (so y = 0 everywhere) and then add noise to all coordinates? For the y coordinate this would give the same result. For the x and z coordinates I guess it would give a slightly different result in two ways:

  1. the underlying grid signal would probably be recoverable statistically, unless maybe the noise sigma is too high and sample size too low;
  2. some samples will be outside the current [0,1] bounds.

Neither would be really relevant to the plane fitting problem, I think, or am I missing something?

In any case, thanks for thinking along!

egpbos commented 5 years ago

By the way (sorry for dragging out this issue :), using the above method, I'm consistently getting a divide by zero warning the first time I do a fit with the model:

/Users/pbos/sw/miniconda3/lib/python3.6/site-packages/symfit/core/fit_results.py:277: RuntimeWarning: divide by zero encountered in double_scalars
  return 1 - SS_res/SS_tot

Anything I can do about this? And I mean anything in the model or the SymFit settings, I don't want to just turn off warnings :)

tBuLi commented 5 years ago

After that explanation I think what you are doing is correct. I was referring to the fact that since the equation of a plane is a x + b y + c z = d, x, y and z are not independent. Once you know two of them, you know the third. I was afraid you were just randomly sampling x, y, and z without using this constraint. But I think you are, since your random samplings for x and z are apparently not noise, but random coordinates which are 100% certain. Only y contains noise.

The degree of freedom question is a very good one, it's worth thinking about but I don't have an answer on the spot. Intuitively I think this does not affect the result since it's just a reordering of the equation, but you are absolutely right that it is something to think about.

And about the warning, I'm not sure there is much you can do about that. Perhaps it is worth creating a seperate issue to fix this, because it's a recurring complaint. Probably R^2 in your fit report is currently -inf or nan or something like that?

And always happy to participate in interesting discussions, so no worries for draging this along ;). I will close it now though, since the original problem has been solved. Feel free to continue the discussion though.

egpbos commented 5 years ago

The degree of freedom question is a very good one, it's worth thinking about but I don't have an answer on the spot. Intuitively I think this does not affect the result since it's just a reordering of the equation, but you are absolutely right that it is something to think about.

Maybe you're right. I'm just not sure how the fitting procedures exactly treat variables vs parameters. In principle, since the amount of parameters doesn't change, and these are the actual things that the fitter should vary (in order to find the best fitting values to the static input data that presumably is the only thing that's used for the variables), it indeed shouldn't matter.

Probably R^2 in your fit report is currently -inf or nan or something like that?

Indeed, it's -inf.

egpbos commented 5 years ago

So, if I read this on the coefficient of determination, I think the problem here is that the "variable" f is not actually conceptually a variable at all, it's a constant, so the "variance of the data" SS_tot is meaningless, because the data is really only a work-around to mimic a constant value and indeed there is no variance in it.

I think the cleanest way out of this would be to create a special Constant class for this usecase. This would also remove the need to fake data, as we did above.

No idea how feasible this is, though.

Note by the way that I am currently also working on RooFit (a package in particle physics toolkit ROOT). This package also distinguishes between Constants, Observables and Parameters (although there parameters can also be set to be constant with an attribute when one wants to fit only a certain set of parameters in a certain prebuilt model... so it's complicated. But still, the concept of always constant numbers seems to make sense in that fitting package, so maybe here as well). Observables map directly to SymFit's Variables, with (observed) data.