Closed ragavsachdeva closed 2 months ago
hi ! What version of datasets
are you using ? Is this issue also happening with datasets==3.0.0
?
Asking because we made sure to replicate the official webdataset logic, which is to use the latest dot as separator between the sample base name and the key
Hi, yes this is still a problem on datasets==3.0.0
.
I was using datasets=2.20.0
and in that version you get the key error.
I just upgraded to datasets==3.0.0
and in this version, you do not get a key error because it sets all keys to none by default in _generate_examples
function:
if field_name not in example:
example[field_name] = None
However, the behaviour is still incorrect. This if
condition is triggered because the filename is not split properly and it returns the data as None
when it shouldn't.
we made sure to replicate the official webdataset logic, which is to use the latest dot as separator
Ah, but that's not what split(, 1)
does though. This is exactly why I'm suggesting to use rsplit
instead. In general, using rsplit
should not be a breaking change I believe.
Hi @ragavsachdeva,
We already had this discussion in the issue you have linked:
However, we decided not to implement this feature because it is NOT aligned with the behavior of the webdataset
library:
The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.
In [1]: import webdataset as wds
In [2]: wds.tariterators.base_plus_ext("22.05.png") Out[2]: ('22', '05.png')
Ah, my apologies I missed https://github.com/huggingface/datasets/pull/6888 (clearly didn't do my due diligence). It's such a weird convention to have though. My keys are /some/path/22.0/1.1.png
and it splits them at /some/path/22
and .0/1.1.png
(!) I'm okay with this PR not being merged though. Thanks for your time.
Actually datasets
is not behaving correctly in this case and should not split as .0/1.1.png
- even webdataset handles this correctly via their regex ^((?:.*/|)[^.]+)[.]([^/]*)$
in wds.tariterators.base_plus_ext
here:
Oh.. the intention with that regex is to capture "multi-part" extensions e.g. .tar.gz
. Makes sense. So rsplit
isn't the solution then and neither is split
. This expression makes so much more sense. Nice find! I'm assuming you'll add a patch?
Issue addressed by:
I was running into
The issue is that a filename may have multiple "." e.g.
22.05.png
. Changingsplit
torsplit
fixes it.Related https://github.com/huggingface/datasets/issues/6880