Cannot provide data for interdependent variables

pckroon commented 5 years ago

I'm working on creating some sadistic testcases for symfit, and I came across the following. Obviously this is a toy example, but that's besides the point.

Y1, Y2, x = variables('Y1, Y2, x')
k = Parameter('k')
model = Model({
    Y1: x,
    Y2: Y1 + k
})
fit = Fit(model, Y1=Y1_data, Y2=Y2_data, x=x_data)

Traceback (most recent call last):
  File "model_eval_bug.py", line 10, in <module>
    fit = Fit(model, Y1=np.array([1]), Y2=np.array([1]), x=np.array([1]))
  File "/home/peterkroon/python/gits/mine/symfit/symfit/core/support.py", line 424, in wrapped_func
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/peterkroon/python/gits/mine/symfit/symfit/core/fit.py", line 1259, in __init__
    super(Fit, self).__init__(model, *ordered_data, **named_data)
  File "/home/peterkroon/python/gits/mine/symfit/symfit/core/support.py", line 424, in wrapped_func
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/peterkroon/python/gits/mine/symfit/symfit/core/fit.py", line 950, in __init__
    raise err
  File "/home/peterkroon/python/gits/mine/symfit/symfit/core/fit.py", line 944, in __init__
    bound_arguments = signature.bind(*ordered_data, **named_data)
  File "/usr/lib/python3.6/inspect.py", line 2997, in bind
    return args[0]._bind(args[1:], kwargs)
  File "/usr/lib/python3.6/inspect.py", line 2988, in _bind
    arg=next(iter(kwargs))))
TypeError: got an unexpected keyword argument 'Y1'

In this case Y1 is an interdependent variable. and I cannot provide data for it. What does it mean to provide data for interdependent variables? Is there a fundamental reason it's not possible? The only one I can come up with is the errors become weird, and that because of that the least squares can no longer be evaluated.

tBuLi commented 5 years ago

This is not possible by definition because, as you say, what does it mean to provide data to an intermediate variable? They are defined by the fact that they are obtained by calculation, not from data directly.

If it not done already, this should perhaps be stated clearly in the documentation, as well as a test based on your example to make sure this is not changed in the future. Possibly this exception could be replaced by a custom one which helps the user understand the problem better.

pckroon commented 5 years ago

I think there are experiments fitting the model: {y1: f(x; a), y2: g(y1, z; b)} where both y1 and y2 can be measured. I don't have explicit examples of course, since that would be convenient.

tBuLi commented 5 years ago

I think I have even done fit's like that myself, but then the role of y1 in those equations is not exactly the same. Because I take it that here you both want to fit y1: f(x; a) using data, while simultaneously using the calculated y1 in y2: g(y1, z; b)?

It is always possible to introduce an extra variable to get around this: {y1_calc: f(x; a), y1: y1_calc, y2: g(y1_calc, z; b)} This model will have both y1 and y2 as dependent variables again.

pckroon commented 5 years ago

Yes. I'm looking for parameters a and b. Maybe the example is more interesting when b = a: {y1: f(x; a), y2: g(y1, z; a)}. Having to create a dummy variable to appease the software grates at me though, especially since I managed to measure both y1 and y2 in the lab.

tBuLi commented 5 years ago

I personally don't see that as a problem, because explicit is better than implicit. The problem with allowing data for interdependent variables would be that in your last example,

model_dict = {y1: f(x; a), y2: g(y1, z; a)}
fit = Fit(model_dict, y1=y1data, y2=y2data)

Would do something quite different from

model_dict = {y1: f(x; a), y2: g(y1, z; a)}
fit = Fit(model_dict, y2=y2data)

but both would run without error. I think that could cause some very frustrating bugs. I'd much rather see that the first just throws an error, so people can debug more effectively.

Put differently, the interdependent variables are not end nodes of the graph, so you cant touch them.

pckroon commented 5 years ago

Whatever decision we go for, this needs to be in the documentation with big flashy letters (like the rest). I don't even really mind your examples doing different things, the input is different was well, after all. Garbage in == garbage out. More interesting, if we take the following examples, what would the objectives be?

model = Model({y1: f(x1; a), y2: g(x2, y1; b)}
fit1 = Fit(model, x1=x1_data, x2=x2_data, y1=y1_data, y2=y2_data)  # 1
fit2 = Fit(model, x1=x1_data, x2=x2_data, y1=y1_data)  # 2
fit3 = Fit(model, x1=x1_data, x2=x2_data, y2=y2_data)  # 3
fit4 = Fit(model, x1=x1_data, x2=x2_data)  # 4

fit1 would be a least squares, and fit4 a minimization of y2 (?). I feel like fit3 should also be a least squares. I'm not sure what fit2 should be; probably an error since it's neither a minimization, nor an actual fit.

tBuLi / symfit

Cannot provide data for interdependent variables #270