Closed jarednielsen closed 4 years ago
In 0.4.0, the assertion in concatenate_datasets
is on the features, and not the schema.
Could you try to update nlp
?
Also, since 0.4.0, you can use dset_wikipedia.cast_(dset_books.features)
to avoid the schema cast hack.
Or maybe the assertion comes from elsewhere ?
I'm using the master branch. The assertion failure comes from the underlying pa.concat_tables()
, which is in the pyarrow package. That method does check schemas.
Since features.type
does not contain information about nullable vs non-nullable features, the cast_()
method won't resolve the schema mismatch. There is information in a schema which is not stored in features.
I'm doing a refactor of type inference in #363 . Both text fields should match after that
By default nullable will be set to True
It should be good now. I was able to run
>>> from nlp import concatenate_datasets, load_dataset
>>>
>>> bookcorpus = load_dataset("bookcorpus", split="train")
>>> wiki = load_dataset("wikipedia", "20200501.en", split="train")
>>> wiki.remove_columns_("title") # only keep the text
>>>
>>> assert bookcorpus.features.type == wiki.features.type
>>> bert_dataset = concatenate_datasets([bookcorpus, wiki])
Thanks!
Here's the code I'm trying to run:
This fails because they have different schemas, despite having identical features.
The Wikipedia dataset has 'text: string', while the BookCorpus dataset has 'text: string not null'. Currently I hack together a working schema match with the following line, but it would be better if this was handled in Features themselves.