huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.63k forks source link

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe' #4210

Closed loretoparisi closed 2 years ago

loretoparisi commented 2 years ago

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.10.0+cu111 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@LysandreJik

Information

Tasks

Reproduction

from datasets import load_dataset,Features,Value,ClassLabel

class_names = ["cmn","deu","rus","fra","eng","jpn","spa","ita","kor","vie","nld","epo","por","tur","heb","hun","ell","ind","ara","arz","fin","bul","yue","swe","ukr","bel","que","ces","swh","nno","wuu","nob","zsm","est","kat","pol","lat","urd","sqi","isl","fry","afr","ron","fao","san","bre","tat","yid","uig","uzb","srp","qya","dan","pes","slk","eus","cycl","acm","tgl","lvs","kaz","hye","hin","lit","ben","cat","bos","hrv","tha","orv","cha","mon","lzh","scn","gle","mkd","slv","frm","glg","vol","ain","jbo","tok","ina","nds","mal","tlh","roh","ltz","oss","ido","gla","mlt","sco","ast","jav","oci","ile","ota","xal","tel","sjn","nov","khm","tpi","ang","aze","tgk","tuk","chv","hsb","dsb","bod","sme","cym","mri","ksh","kmr","ewe","kab","ber","tpw","udm","lld","pms","lad","grn","mlg","xho","pnb","grc","hat","lao","npi","cor","nah","avk","mar","guj","pan","kir","myv","prg","sux","crs","ckt","bak","zlm","hil","cbk","chr","nav","lkt","enm","arq","lin","abk","pcd","rom","gsw","tam","zul","awa","wln","amh","bar","hbo","mhr","bho","mrj","ckb","osx","pfl","mgm","sna","mah","hau","kan","nog","sin","glv","dng","kal","liv","vro","apc","jdt","fur","che","haw","yor","crh","pdc","ppl","kin","shs","mnw","tet","sah","kum","ngt","nya","pus","hif","mya","moh","wol","tir","ton","lzz","oar","lug","brx","non","mww","hak","nlv","ngu","bua","aym","vec","ibo","tkl","bam","kha","ceb","lou","fuc","smo","gag","lfn","arg","umb","tyv","kjh","oji","cyo","urh","kzj","pam","srd","lmo","swg","mdf","gil","snd","tso","sot","zza","tsn","pau","som","egl","ady","asm","ori","dtp","cho","max","kam","niu","sag","ilo","kaa","fuv","nch","hoc","iba","gbm","sun","war","mvv","pap","ary","kxi","csb","pag","cos","rif","kek","krc","aii","ban","ssw","tvl","mfe","tah","bvy","bcl","hnj","nau","nst","afb","quc","min","tmw","mad","bjn","mai","cjy","got","hsn","gan","tzl","dws","ldn","afh","sgs","krl","vep","rue","tly","mic","ext","izh","sma","jam","cmo","mwl","kpv","koi","bis","ike","run","evn","ryu","mnc","aoz","otk","kas","aln","akl","yua","shy","fkv","gos","fij","thv","zgh","gcf","cay","xmf","tig","div","lij","rap","hrx","cpi","tts","gaa","tmr","iii","ltg","bzt","syc","emx","gom","chg","osp","stq","frr","fro","nys","toi","new","phn","jpa","rel","drt","chn","pli","laa","bal","hdn","hax","mik","ajp","xqa","pal","crk","mni","lut","ayl","ood","sdh","ofs","nus","kiu","diq","qxq","alt","bfz","klj","mus","srn","guc","lim","zea","shi","mnr","bom","sat","szl"]
features = Features({ 'label': ClassLabel(names=class_names), 'text': Value('string')})
num_labels = features['label'].num_classes
data_files = { "train": "train.csv", "test": "test.csv" }
sentences = load_dataset("loretoparisi/tatoeba-sentences",
                             data_files=data_files,
                             delimiter='\t', 
                             column_names=['label', 'text'],
                             features = features

ERROR:

ClassLabel(num_classes=403, names=['cmn', 'deu', 'rus', 'fra', 'eng', 'jpn', 'spa', 'ita', 'kor', 'vie', 'nld', 'epo', 'por', 'tur', 'heb', 'hun', 'ell', 'ind', 'ara', 'arz', 'fin', 'bul', 'yue', 'swe', 'ukr', 'bel', 'que', 'ces', 'swh', 'nno', 'wuu', 'nob', 'zsm', 'est', 'kat', 'pol', 'lat', 'urd', 'sqi', 'isl', 'fry', 'afr', 'ron', 'fao', 'san', 'bre', 'tat', 'yid', 'uig', 'uzb', 'srp', 'qya', 'dan', 'pes', 'slk', 'eus', 'cycl', 'acm', 'tgl', 'lvs', 'kaz', 'hye', 'hin', 'lit', 'ben', 'cat', 'bos', 'hrv', 'tha', 'orv', 'cha', 'mon', 'lzh', 'scn', 'gle', 'mkd', 'slv', 'frm', 'glg', 'vol', 'ain', 'jbo', 'tok', 'ina', 'nds', 'mal', 'tlh', 'roh', 'ltz', 'oss', 'ido', 'gla', 'mlt', 'sco', 'ast', 'jav', 'oci', 'ile', 'ota', 'xal', 'tel', 'sjn', 'nov', 'khm', 'tpi', 'ang', 'aze', 'tgk', 'tuk', 'chv', 'hsb', 'dsb', 'bod', 'sme', 'cym', 'mri', 'ksh', 'kmr', 'ewe', 'kab', 'ber', 'tpw', 'udm', 'lld', 'pms', 'lad', 'grn', 'mlg', 'xho', 'pnb', 'grc', 'hat', 'lao', 'npi', 'cor', 'nah', 'avk', 'mar', 'guj', 'pan', 'kir', 'myv', 'prg', 'sux', 'crs', 'ckt', 'bak', 'zlm', 'hil', 'cbk', 'chr', 'nav', 'lkt', 'enm', 'arq', 'lin', 'abk', 'pcd', 'rom', 'gsw', 'tam', 'zul', 'awa', 'wln', 'amh', 'bar', 'hbo', 'mhr', 'bho', 'mrj', 'ckb', 'osx', 'pfl', 'mgm', 'sna', 'mah', 'hau', 'kan', 'nog', 'sin', 'glv', 'dng', 'kal', 'liv', 'vro', 'apc', 'jdt', 'fur', 'che', 'haw', 'yor', 'crh', 'pdc', 'ppl', 'kin', 'shs', 'mnw', 'tet', 'sah', 'kum', 'ngt', 'nya', 'pus', 'hif', 'mya', 'moh', 'wol', 'tir', 'ton', 'lzz', 'oar', 'lug', 'brx', 'non', 'mww', 'hak', 'nlv', 'ngu', 'bua', 'aym', 'vec', 'ibo', 'tkl', 'bam', 'kha', 'ceb', 'lou', 'fuc', 'smo', 'gag', 'lfn', 'arg', 'umb', 'tyv', 'kjh', 'oji', 'cyo', 'urh', 'kzj', 'pam', 'srd', 'lmo', 'swg', 'mdf', 'gil', 'snd', 'tso', 'sot', 'zza', 'tsn', 'pau', 'som', 'egl', 'ady', 'asm', 'ori', 'dtp', 'cho', 'max', 'kam', 'niu', 'sag', 'ilo', 'kaa', 'fuv', 'nch', 'hoc', 'iba', 'gbm', 'sun', 'war', 'mvv', 'pap', 'ary', 'kxi', 'csb', 'pag', 'cos', 'rif', 'kek', 'krc', 'aii', 'ban', 'ssw', 'tvl', 'mfe', 'tah', 'bvy', 'bcl', 'hnj', 'nau', 'nst', 'afb', 'quc', 'min', 'tmw', 'mad', 'bjn', 'mai', 'cjy', 'got', 'hsn', 'gan', 'tzl', 'dws', 'ldn', 'afh', 'sgs', 'krl', 'vep', 'rue', 'tly', 'mic', 'ext', 'izh', 'sma', 'jam', 'cmo', 'mwl', 'kpv', 'koi', 'bis', 'ike', 'run', 'evn', 'ryu', 'mnc', 'aoz', 'otk', 'kas', 'aln', 'akl', 'yua', 'shy', 'fkv', 'gos', 'fij', 'thv', 'zgh', 'gcf', 'cay', 'xmf', 'tig', 'div', 'lij', 'rap', 'hrx', 'cpi', 'tts', 'gaa', 'tmr', 'iii', 'ltg', 'bzt', 'syc', 'emx', 'gom', 'chg', 'osp', 'stq', 'frr', 'fro', 'nys', 'toi', 'new', 'phn', 'jpa', 'rel', 'drt', 'chn', 'pli', 'laa', 'bal', 'hdn', 'hax', 'mik', 'ajp', 'xqa', 'pal', 'crk', 'mni', 'lut', 'ayl', 'ood', 'sdh', 'ofs', 'nus', 'kiu', 'diq', 'qxq', 'alt', 'bfz', 'klj', 'mus', 'srn', 'guc', 'lim', 'zea', 'shi', 'mnr', 'bom', 'sat', 'szl'], id=None)
Value(dtype='string', id=None)
Using custom data configuration loretoparisi--tatoeba-sentences-7b2c5e991f398f39
Downloading and preparing dataset csv/loretoparisi--tatoeba-sentences to /root/.cache/huggingface/datasets/csv/loretoparisi--tatoeba-sentences-7b2c5e991f398f39/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...
Downloading data files: 100%
2/2 [00:18<00:00, 8.06s/it]
Downloading data: 100%
391M/391M [00:13<00:00, 35.3MB/s]
Downloading data: 100%
92.4M/92.4M [00:02<00:00, 36.5MB/s]
Failed to read file '/root/.cache/huggingface/datasets/downloads/933132df9905194ea9faeb30cabca8c49318795612f6495fcb941a290191dd5d' with error <class 'ValueError'>: invalid literal for int() with base 10: 'cmn'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
15 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: invalid literal for int() with base 10: 'cmn'

while loading without features it loads without errors

sentences = load_dataset("loretoparisi/tatoeba-sentences",
                             data_files=data_files,
                             delimiter='\t', 
                             column_names=['label', 'text']
                         )

but the label col seems to be wrong (without the ClassLabel object):

sentences['train'].features
{'label': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None)}

The dataset was https://huggingface.co/datasets/loretoparisi/tatoeba-sentences

Dataset format is:

ces Nechci vědět, co je tam uvnitř.
ces Kdo o tom chce slyšet?
deu Tom sagte, er fühle sich nicht wohl.
ber Mel-iyi-d anida-t tura ?
hun Gondom lesz rá rögtön.
ber Mel-iyi-d anida-tt tura ?
deu Ich will dich nicht reden hören.

Expected behavior

correctly load train and test files.
mariosasko commented 2 years ago

Hi! Casting class labels from strings is currently not supported in the CSV loader, but you can get the same result with an additional map as follows:

from datasets import load_dataset,Features,Value,ClassLabel
class_names = ["cmn","deu","rus","fra","eng","jpn","spa","ita","kor","vie","nld","epo","por","tur","heb","hun","ell","ind","ara","arz","fin","bul","yue","swe","ukr","bel","que","ces","swh","nno","wuu","nob","zsm","est","kat","pol","lat","urd","sqi","isl","fry","afr","ron","fao","san","bre","tat","yid","uig","uzb","srp","qya","dan","pes","slk","eus","cycl","acm","tgl","lvs","kaz","hye","hin","lit","ben","cat","bos","hrv","tha","orv","cha","mon","lzh","scn","gle","mkd","slv","frm","glg","vol","ain","jbo","tok","ina","nds","mal","tlh","roh","ltz","oss","ido","gla","mlt","sco","ast","jav","oci","ile","ota","xal","tel","sjn","nov","khm","tpi","ang","aze","tgk","tuk","chv","hsb","dsb","bod","sme","cym","mri","ksh","kmr","ewe","kab","ber","tpw","udm","lld","pms","lad","grn","mlg","xho","pnb","grc","hat","lao","npi","cor","nah","avk","mar","guj","pan","kir","myv","prg","sux","crs","ckt","bak","zlm","hil","cbk","chr","nav","lkt","enm","arq","lin","abk","pcd","rom","gsw","tam","zul","awa","wln","amh","bar","hbo","mhr","bho","mrj","ckb","osx","pfl","mgm","sna","mah","hau","kan","nog","sin","glv","dng","kal","liv","vro","apc","jdt","fur","che","haw","yor","crh","pdc","ppl","kin","shs","mnw","tet","sah","kum","ngt","nya","pus","hif","mya","moh","wol","tir","ton","lzz","oar","lug","brx","non","mww","hak","nlv","ngu","bua","aym","vec","ibo","tkl","bam","kha","ceb","lou","fuc","smo","gag","lfn","arg","umb","tyv","kjh","oji","cyo","urh","kzj","pam","srd","lmo","swg","mdf","gil","snd","tso","sot","zza","tsn","pau","som","egl","ady","asm","ori","dtp","cho","max","kam","niu","sag","ilo","kaa","fuv","nch","hoc","iba","gbm","sun","war","mvv","pap","ary","kxi","csb","pag","cos","rif","kek","krc","aii","ban","ssw","tvl","mfe","tah","bvy","bcl","hnj","nau","nst","afb","quc","min","tmw","mad","bjn","mai","cjy","got","hsn","gan","tzl","dws","ldn","afh","sgs","krl","vep","rue","tly","mic","ext","izh","sma","jam","cmo","mwl","kpv","koi","bis","ike","run","evn","ryu","mnc","aoz","otk","kas","aln","akl","yua","shy","fkv","gos","fij","thv","zgh","gcf","cay","xmf","tig","div","lij","rap","hrx","cpi","tts","gaa","tmr","iii","ltg","bzt","syc","emx","gom","chg","osp","stq","frr","fro","nys","toi","new","phn","jpa","rel","drt","chn","pli","laa","bal","hdn","hax","mik","ajp","xqa","pal","crk","mni","lut","ayl","ood","sdh","ofs","nus","kiu","diq","qxq","alt","bfz","klj","mus","srn","guc","lim","zea","shi","mnr","bom","sat","szl"]
features = Features({ 'label': ClassLabel(names=class_names), 'text': Value('string')})
num_labels = features['label'].num_classes
data_files = { "train": "train.csv", "test": "test.csv" }
sentences = load_dataset(
     "loretoparisi/tatoeba-sentences",
     data_files=data_files,
     delimiter='\t', 
     column_names=['label', 'text'],
)
# You can make this part faster with num_proc=<some int>
sentences = sentences.map(lambda ex: features["label"].str2int(ex["label"]) if ex["label"] is not None else None, features=features)

@lhoestq IIRC, I suggested adding cast_to_storage to ClassLabel + table_cast to the packaged loaders if the ClassLabel/Image/Audio type is present in features to avoid this kind of error, but your concern was speed. IMO shouldn't be a problem if we do table_cast only when these features are present.

albertvillanova commented 2 years ago

I agree packaged loaders should support ClassLabel feature without throwing an error.

loretoparisi commented 2 years ago

@albertvillanova @mariosasko thank you, with that change now I get

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-9-eeb68eeb9bec>](https://localhost:8080/#) in <module>()
     11 )
     12 # You can make this part faster with num_proc=<some int>
---> 13 sentences = sentences.map(lambda ex: features["label"].str2int(ex["label"]) if ex["label"] is not None else None, features=features)
     14 sentences = sentences.shuffle()

8 frames
[/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in validate_function_output(processed_inputs, indices)
   2193             if processed_inputs is not None and not isinstance(processed_inputs, (Mapping, pa.Table)):
   2194                 raise TypeError(
-> 2195                     f"Provided `function` which is applied to all elements of table returns a variable of type {type(processed_inputs)}. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects."
   2196                 )
   2197             elif isinstance(indices, list) and isinstance(processed_inputs, Mapping):

TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'int'>. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects.

the error is raised by this

[/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in validate_function_output(processed_inputs, indices)
loretoparisi commented 2 years ago

@mariosasko changed it like

sentences = sentences.map(lambda ex: {"label" : features["label"].str2int(ex["label"]) if ex["label"] is not None else None}, features=features)

to avoid the above errorr.

paulthemagno commented 2 years ago

Any update on this? Is this correct ?

@mariosasko changed it like

sentences = sentences.map(lambda ex: {"label" : features["label"].str2int(ex["label"]) if ex["label"] is not None else None}, features=features)

to avoid the above errorr.