Support for variably sized inputs

kiudee commented 4 years ago

cs-ranking is currently requiring the user to provide a fixed input size. There exist different tricks to be able to handle variably sized inputs (e.g. padding to maximum length), but all with their own trade-offs.

Approaches like nested tensors could be useful in solving this issue more elegantly.

ddelange commented 3 years ago

Hi @kiudee,

From the docstring of fit() for X input: https://github.com/kiudee/cs-ranking/blob/f266087244ad446448a76b02075514b1136bf270/csrank/core/fate_network.py#L470-L471 and also the dict check in _fit(): https://github.com/kiudee/cs-ranking/blob/f266087244ad446448a76b02075514b1136bf270/csrank/core/fate_network.py#L313

suggest there should already be some kind of support for instances with different amount of objects inside.

But then on the first line of fit(), X input is assumed to be a fixed size numpy array:

https://github.com/kiudee/cs-ranking/blob/f266087244ad446448a76b02075514b1136bf270/csrank/core/fate_network.py#L508

Could you elaborate on the current state of things?

Or more concrete:

Is there a workaround I can exploit for the moment to get FATEObjectRanker rolling for my data?
Or can you point me to what to fix to get support for the dict input?
And for the keys of the dict, can they just be index of the instances like 0, 1, 2, 3, ...? The docstring says map from n_objects to numpy arrays, but I have multiple instances with the same number of n_objects inside them, meaning I would have to choose e.g. the first one in order to cast to dict.

padding to maximum length

Here, do you mean simply padding both X and Y with np.zeros? What would be the tradeoff you mentioned there? Any other alternatives?

kiudee commented 3 years ago

Hey @ddelange, we are currently in the process of migrating the complete code base to PyTorch, which is why progress on the issue front is currently slow.

Regarding fit() it is true that it currently does not support dict input anymore. _fit() expects a dict of the form:

{
    3: np.array(...),  # shape: (n_instances_with_3_objects, 3, n_features) 
    4: np.array(...),  # shape: (n_instances_with_4_objects, 4, n_features)
    ...
}

That way each np.array(...) can contain multiple instances of the same size. What the _fit() method then does is to train on each of the sizes separately (with shared weights) and updates the weights proportional to the number of instances present for the given set size. So a workaround could be to use _fit() directly, but the dict support has not been tested for a while.

Since the line https://github.com/kiudee/cs-ranking/blob/f266087244ad446448a76b02075514b1136bf270/csrank/core/fate_network.py#L508 is only ever used for the fixed size representation, it could be as simple as inserting an if there:

if not isinstance(X, dict):
    _n_instances, self.n_objects_fit_, self.n_object_features_fit_ = X.shape

@timokau what do you think?

padding to maximum length

Here, do you mean simply padding both X and Y with np.zeros? What would be the tradeoff you mentioned there? Any other alternatives?

Yes, basically you would determine the maximum number of objects you want to input, lets call it n_objects_max, and then construct an array of shape (n_instances, n_objects_max, n_features), which you initialize with zeros. Then for each instance you fill it with the corresponding amount of objects. The same thing you can do for Y. One trade-off is of course running time and memory, especially if the number of objects is highly variable. There it is useful to look at how many instances with many objects there really are and possibly discard those which occur rarely. Another problem could be that the "zero objects" impact the model fit in some way, especially if you standardize the inputs.

timokau commented 3 years ago

@timokau what do you think?

I'm not sure if it would be quite that simple. For example the _construct_models function in fate_network.py uses self.n_object-features_fit_ and is also called for variably-sized inputs. That is not a problem, since that remains constant anyway. It would still need to be initialized though. There are probably more cases like this in the code base. Supporting the "train separately and merge weights" approach again would need a bit of work and testing.

kiudee / cs-ranking

Support for variably sized inputs #168