libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.84k stars 178 forks source link

Field name silently cropped at 16 chars #174

Open alcinos opened 2 years ago

alcinos commented 2 years ago

Hi,

Consider the following example:

from ffcv.writer import DatasetWriter
from ffcv.fields import NDArrayField, FloatField, StringField, IntField

class Dummy:
    def __len__(self):
        return 10
    def __getitem__(self, id):
        return (42,)
ds = Dummy()
writer = DatasetWriter("/tmp/test.beton", {
    'this_is_a_very_long_field': IntField(),
    }, num_workers=2)
writer.from_indexed_dataset(ds)

loader = Loader('/tmp/test.beton', batch_size=2, num_workers=2, order=OrderOption.SEQUENTIAL)
print(list(loader.pipelines.keys())[0]) # prints 'this_is_a_very_l'

It's unclear to me if the 16 chars limit is intended (I haven't found any mention of that in the doc, but I may have missed it). If it's not, well it's a bug, and if it is, perhaps it would be appropriate to add an assert in the writer?

As it stands it's very confusing because no warning is issued, and I spent quite some time debugging an issue where my loading pipeline was not applied. I had something like pipelines = {"this_is_a_very_long_field": my_pipeline} and nothing was happening. I think this is the result of two fixable issues:

  1. As mentioned above, since the field name was actually shortened to "this_is_a_very_l" the pipeline didn't work. Throwing an error when trying to create a field name too long (or supporting it) fixes this issue.
  2. No warning was issued for creating a pipeline for a non-existent field. Adding a check to verify if each field used exists and issuing an error/warning otherwise would fix this issue.

Best

GuillaumeLeclerc commented 2 years ago

Hello,

This is indeed the case. FFCV file formats has a limit on the length of the field. We could definitely warn users when they use a name that is too long. Are you interested in submitting a pull request ?