huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Behaviour difference between datasets.map and IterableDatasets.map #5870

Open llStringll opened 1 year ago

llStringll commented 1 year ago

Describe the bug

All the examples in all the docs mentioned throughout huggingface datasets correspond to datasets object, and not IterableDatasets object. At one point of time, they might have been in sync, but the code for datasets version >=2.9.0 is very different as compared to the docs. I basically need to .map() a transform on images in an iterable dataset, which was made using a custom databuilder config. This works very good in map-styles datasets, but the .map() fails in IterableDatasets, show behvaiour as such: "pixel_values" key not found, KeyError in examples object/dict passed into transform function for map, which works fine with map style, even as batch. In iterable style, the object/dict passed into map() paramter callable function is completely different as what is mentioned in all examples. Please look into this. Thank you

My databuilder class is inherited as such:

def _info(self):
    print ("Config: ",self.config.__dict__.keys())
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "labels": datasets.Sequence(datasets.Value("uint16")),
                # "labels_name": datasets.Value("string"),
                # "pixel_values": datasets.Array3D(shape=(3, 1280, 960), dtype="float32"),
                "pixel_values": datasets.Array3D(shape=(1280, 960, 3), dtype="uint8"),
                "image_s3_path": datasets.Value("string"),
            }
        ),
        supervised_keys=None,
        homepage="none",
        citation="",
    )

def _split_generators(self, dl_manager):
    records_train = list(db.mini_set.find({'split':'train'},{'image_s3_path':1, 'ocwen_template_name':1}))[:10000]
    records_val = list(db.mini_set.find({'split':'val'},{'image_s3_path':1, 'ocwen_template_name':1}))[:1000]
    # print (len(records),self.config.num_shards)
    # shard_size_train = len(records_train)//self.config.num_shards
    # sharded_records_train = [records_train[i:i+shard_size_train] for i in range(0,len(records_train),shard_size_train)]
    # shard_size_val = len(records_val)//self.config.num_shards
    # sharded_records_val = [records_val[i:i+shard_size_val] for i in range(0,len(records_val),shard_size_val)]
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN, gen_kwargs={"records":records_train} # passing list of records, for sharding to take over
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION, gen_kwargs={"records":records_val} # passing list of records, for sharding to take over
        ),
    ]

def _generate_examples(self, records):
    # print ("Generating examples for [{}] shards".format(len(shards)))
    # initiate_db_connection()
    # records = list(db.mini_set.find({'split':split},{'image_s3_path':1, 'ocwen_template_name':1}))[:10]
    id_ = 0
    # for records in shards:
    for i,rec in enumerate(records):
        img_local_path = fetch_file(rec['image_s3_path'],self.config.buffer_dir)
        # t = self.config.processor(Image.open(img_local_path), random_padding=True, return_tensors="np").pixel_values.squeeze()
        # print (t.shape, type(t),type(t[0][0][0]))
        # sys.exit()
        pvs = np.array(Image.open(img_local_path).resize((1280,960))) # image object is wxh, so resize as per that, numpy array of it is hxwxc, transposing to cxwxh
        # pvs = self.config.processor(Image.open(img_local_path), random_padding=True, return_tensors="np").pixel_values.astype(np.float16).squeeze()
        # print (type(pvs[0][0][0]))
        lblids = self.config.processor.tokenizer('<s_class>'+rec['ocwen_template_name']+'</s_class>'+'</s>', add_special_tokens=False, padding=False, truncation=False, return_tensors="np")["input_ids"].squeeze(0)  # take padding later, as per batch collating
        # print (len(lblids),type(lblids[0]))
        # print (type(pvs),pvs.shape,type(pvs[0][0][0]), type(lblids))
        yield id_, {"labels":lblids,"pixel_values":pvs,"image_s3_path":rec['image_s3_path']}
        id_+=1
        os.remove(img_local_path)

and I load it inside my trainer script as such ds = load_dataset("/tmp/DonutDS/dataset/", split="train", streaming=True) # iterable dataset, where .map() falls or also as ds = load_from_disk('/tmp/DonutDS/dataset/') #map style dataset

Thank you to the team for having such a great library, and for this bug fix in advance!

Steps to reproduce the bug

Above config can allow one to reproduce the said bug

Expected behavior

.map() should show some consistency b/w map-style and iterable-style datasets, or atleast the docs should address iterable-style datasets behaviour and examples. I honestly do not figure the use of such docs.

Environment info

datasets==2.9.0 transformers==4.26.0

llStringll commented 1 year ago

PS - some work is definitely needed for 'special cases' docs, not explanations, just usages of 'functions' under mixture of special cases, like a combination of custom databuilder + iterable dataset for large size + dynamic .map() application.