huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Fix key error in webdataset #7144

Closed ragavsachdeva closed 2 months ago

ragavsachdeva commented 2 months ago

I was running into

example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
KeyError: 'png'

The issue is that a filename may have multiple "." e.g. 22.05.png. Changing split to rsplit fixes it.

Related https://github.com/huggingface/datasets/issues/6880

lhoestq commented 2 months ago

hi ! What version of datasets are you using ? Is this issue also happening with datasets==3.0.0 ? Asking because we made sure to replicate the official webdataset logic, which is to use the latest dot as separator between the sample base name and the key

ragavsachdeva commented 2 months ago

Hi, yes this is still a problem on datasets==3.0.0.

I was using datasets=2.20.0 and in that version you get the key error.

I just upgraded to datasets==3.0.0 and in this version, you do not get a key error because it sets all keys to none by default in _generate_examples function:

if field_name not in example:
    example[field_name] = None

However, the behaviour is still incorrect. This if condition is triggered because the filename is not split properly and it returns the data as None when it shouldn't.

we made sure to replicate the official webdataset logic, which is to use the latest dot as separator

Ah, but that's not what split(, 1) does though. This is exactly why I'm suggesting to use rsplit instead. In general, using rsplit should not be a breaking change I believe.

albertvillanova commented 2 months ago

Hi @ragavsachdeva,

We already had this discussion in the issue you have linked:

However, we decided not to implement this feature because it is NOT aligned with the behavior of the webdataset library:

The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.


In [1]: import webdataset as wds

In [2]: wds.tariterators.base_plus_ext("22.05.png") Out[2]: ('22', '05.png')

ragavsachdeva commented 2 months ago

Ah, my apologies I missed https://github.com/huggingface/datasets/pull/6888 (clearly didn't do my due diligence). It's such a weird convention to have though. My keys are /some/path/22.0/1.1.png and it splits them at /some/path/22 and .0/1.1.png(!) I'm okay with this PR not being merged though. Thanks for your time.

lhoestq commented 2 months ago

Actually datasets is not behaving correctly in this case and should not split as .0/1.1.png - even webdataset handles this correctly via their regex ^((?:.*/|)[^.]+)[.]([^/]*)$ in wds.tariterators.base_plus_ext here:

https://github.com/webdataset/webdataset/blob/87bd5aa41602d57f070f65a670893ee625702f2f/webdataset/tariterators.py#L36

ragavsachdeva commented 2 months ago

Oh.. the intention with that regex is to capture "multi-part" extensions e.g. .tar.gz. Makes sense. So rsplit isn't the solution then and neither is split. This expression makes so much more sense. Nice find! I'm assuming you'll add a patch?

albertvillanova commented 2 months ago

Issue addressed by: