Open llStringll opened 1 year ago
PS - some work is definitely needed for 'special cases' docs, not explanations, just usages of 'functions' under mixture of special cases, like a combination of custom databuilder + iterable dataset for large size + dynamic .map() application.
Describe the bug
All the examples in all the docs mentioned throughout huggingface datasets correspond to datasets object, and not IterableDatasets object. At one point of time, they might have been in sync, but the code for datasets version >=2.9.0 is very different as compared to the docs. I basically need to .map() a transform on images in an iterable dataset, which was made using a custom databuilder config. This works very good in map-styles datasets, but the .map() fails in IterableDatasets, show behvaiour as such: "pixel_values" key not found, KeyError in examples object/dict passed into transform function for map, which works fine with map style, even as batch. In iterable style, the object/dict passed into map() paramter callable function is completely different as what is mentioned in all examples. Please look into this. Thank you
My databuilder class is inherited as such:
and I load it inside my trainer script as such
ds = load_dataset("/tmp/DonutDS/dataset/", split="train", streaming=True) # iterable dataset, where .map() falls
or also asds = load_from_disk('/tmp/DonutDS/dataset/') #map style dataset
Thank you to the team for having such a great library, and for this bug fix in advance!
Steps to reproduce the bug
Above config can allow one to reproduce the said bug
Expected behavior
.map() should show some consistency b/w map-style and iterable-style datasets, or atleast the docs should address iterable-style datasets behaviour and examples. I honestly do not figure the use of such docs.
Environment info
datasets==2.9.0 transformers==4.26.0