pesser / edflow

Reduce boilerplate code for your ML projects. TensorFlow and PyTorch.
https://edflow.readthedocs.io/
MIT License
25 stars 13 forks source link

Numpy check #221

Closed hperrot closed 4 years ago

hperrot commented 4 years ago

This pull request makes it fail loudly if labels are set as lists and not numpy arrays or memmaps. This addresses the warning "Labels must be dict s of numpy arrays and not list s! Otherwise many operations do not work and result in incomprehensible errors." from the dataset_mixin docstring.

The way this is done is by casting labels in the labels dict to a LabelsDict in the DatasetMixin@labels.setter. The LabelsDict is a subclass of dict and asserts in the setitem function that all new leaf nodes are of type np.ndarray or np.memmap.

Asserting this in the DatasetMixin@labels.setter would not be enough since the labels dict is often updated and not set as a whole. When updating the labels dict, the DatasetMixin@labels.setter is not executed anymore and could not assert this then.

One drawback of this method is that this assertion is also not executed if the labels dict itself is nested and contains other mutable objects like dicts or lists and these are updated somehow.

hperrot commented 4 years ago

There is an issue with coveralls, where it is incompatible with coverage 5.0+ https://github.com/coveralls-clients/coveralls-python/issues/203 This is why the test don't complete successfully.

pesser commented 4 years ago

Cool, I really like the idea of giving dynamic feedback if something is used in unexpected ways. But I have reservations about the idea of using (and introducing) a special class for labels. @jhaux also tried something along this line and ultimately we decided against merging it. A new class always adds a lot of (perceived) complexity and can be off-putting. If we need a special type, we need it but let's think about alternatives and what @jhaux says, since he knows the requirements for the labels best.

I wonder if we could just add a functional test somewhere to make sure labels look as expected. For example, run check_dataset(dataset) after instantiating the dataset from the config. This would avoid introducing another class and could be more flexible to handle nested things which is not possible with the proposed solution.