huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

nlp.Features does not distinguish between nullable and non-nullable types in PyArrow schema #492

Closed jarednielsen closed 4 years ago

jarednielsen commented 4 years ago

Here's the code I'm trying to run:

dset_wikipedia = nlp.load_dataset("wikipedia", "20200501.en", split="train", cache_dir=args.cache_dir)
dset_wikipedia.drop(columns=["title"])
dset_wikipedia.features.pop("title")
dset_books = nlp.load_dataset("bookcorpus", split="train", cache_dir=args.cache_dir)
dset = nlp.concatenate_datasets([dset_wikipedia, dset_books])

This fails because they have different schemas, despite having identical features.

assert dset_wikipedia.features == dset_books.features # True
assert dset_wikipedia._data.schema == dset_books._data.schema # False

The Wikipedia dataset has 'text: string', while the BookCorpus dataset has 'text: string not null'. Currently I hack together a working schema match with the following line, but it would be better if this was handled in Features themselves.

dset_wikipedia._data = dset_wikipedia.data.cast(dset_books._data.schema)
lhoestq commented 4 years ago

In 0.4.0, the assertion in concatenate_datasets is on the features, and not the schema. Could you try to update nlp ?

Also, since 0.4.0, you can use dset_wikipedia.cast_(dset_books.features) to avoid the schema cast hack.

lhoestq commented 4 years ago

Or maybe the assertion comes from elsewhere ?

jarednielsen commented 4 years ago

I'm using the master branch. The assertion failure comes from the underlying pa.concat_tables(), which is in the pyarrow package. That method does check schemas.

Since features.type does not contain information about nullable vs non-nullable features, the cast_() method won't resolve the schema mismatch. There is information in a schema which is not stored in features.

lhoestq commented 4 years ago

I'm doing a refactor of type inference in #363 . Both text fields should match after that

lhoestq commented 4 years ago

By default nullable will be set to True

lhoestq commented 4 years ago

It should be good now. I was able to run

>>> from nlp import concatenate_datasets, load_dataset
>>>
>>> bookcorpus = load_dataset("bookcorpus", split="train")
>>> wiki = load_dataset("wikipedia", "20200501.en", split="train")
>>> wiki.remove_columns_("title")  # only keep the text
>>>
>>> assert bookcorpus.features.type == wiki.features.type
>>> bert_dataset = concatenate_datasets([bookcorpus, wiki])
jarednielsen commented 4 years ago

Thanks!