nlp.Features does not distinguish between nullable and non-nullable types in PyArrow schema

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.24k stars 2.69k forks source link

nlp.Features does not distinguish between nullable and non-nullable types in PyArrow schema #492

Closed jarednielsen closed 4 years ago

jarednielsen commented 4 years ago

Here's the code I'm trying to run:

dset_wikipedia = nlp.load_dataset("wikipedia", "20200501.en", split="train", cache_dir=args.cache_dir)
dset_wikipedia.drop(columns=["title"])
dset_wikipedia.features.pop("title")
dset_books = nlp.load_dataset("bookcorpus", split="train", cache_dir=args.cache_dir)
dset = nlp.concatenate_datasets([dset_wikipedia, dset_books])

This fails because they have different schemas, despite having identical features.

assert dset_wikipedia.features == dset_books.features # True
assert dset_wikipedia._data.schema == dset_books._data.schema # False

The Wikipedia dataset has 'text: string', while the BookCorpus dataset has 'text: string not null'. Currently I hack together a working schema match with the following line, but it would be better if this was handled in Features themselves.

dset_wikipedia._data = dset_wikipedia.data.cast(dset_books._data.schema)