Closed loretoparisi closed 2 years ago
Hi! Casting class labels from strings is currently not supported in the CSV loader, but you can get the same result with an additional map as follows:
from datasets import load_dataset,Features,Value,ClassLabel
class_names = ["cmn","deu","rus","fra","eng","jpn","spa","ita","kor","vie","nld","epo","por","tur","heb","hun","ell","ind","ara","arz","fin","bul","yue","swe","ukr","bel","que","ces","swh","nno","wuu","nob","zsm","est","kat","pol","lat","urd","sqi","isl","fry","afr","ron","fao","san","bre","tat","yid","uig","uzb","srp","qya","dan","pes","slk","eus","cycl","acm","tgl","lvs","kaz","hye","hin","lit","ben","cat","bos","hrv","tha","orv","cha","mon","lzh","scn","gle","mkd","slv","frm","glg","vol","ain","jbo","tok","ina","nds","mal","tlh","roh","ltz","oss","ido","gla","mlt","sco","ast","jav","oci","ile","ota","xal","tel","sjn","nov","khm","tpi","ang","aze","tgk","tuk","chv","hsb","dsb","bod","sme","cym","mri","ksh","kmr","ewe","kab","ber","tpw","udm","lld","pms","lad","grn","mlg","xho","pnb","grc","hat","lao","npi","cor","nah","avk","mar","guj","pan","kir","myv","prg","sux","crs","ckt","bak","zlm","hil","cbk","chr","nav","lkt","enm","arq","lin","abk","pcd","rom","gsw","tam","zul","awa","wln","amh","bar","hbo","mhr","bho","mrj","ckb","osx","pfl","mgm","sna","mah","hau","kan","nog","sin","glv","dng","kal","liv","vro","apc","jdt","fur","che","haw","yor","crh","pdc","ppl","kin","shs","mnw","tet","sah","kum","ngt","nya","pus","hif","mya","moh","wol","tir","ton","lzz","oar","lug","brx","non","mww","hak","nlv","ngu","bua","aym","vec","ibo","tkl","bam","kha","ceb","lou","fuc","smo","gag","lfn","arg","umb","tyv","kjh","oji","cyo","urh","kzj","pam","srd","lmo","swg","mdf","gil","snd","tso","sot","zza","tsn","pau","som","egl","ady","asm","ori","dtp","cho","max","kam","niu","sag","ilo","kaa","fuv","nch","hoc","iba","gbm","sun","war","mvv","pap","ary","kxi","csb","pag","cos","rif","kek","krc","aii","ban","ssw","tvl","mfe","tah","bvy","bcl","hnj","nau","nst","afb","quc","min","tmw","mad","bjn","mai","cjy","got","hsn","gan","tzl","dws","ldn","afh","sgs","krl","vep","rue","tly","mic","ext","izh","sma","jam","cmo","mwl","kpv","koi","bis","ike","run","evn","ryu","mnc","aoz","otk","kas","aln","akl","yua","shy","fkv","gos","fij","thv","zgh","gcf","cay","xmf","tig","div","lij","rap","hrx","cpi","tts","gaa","tmr","iii","ltg","bzt","syc","emx","gom","chg","osp","stq","frr","fro","nys","toi","new","phn","jpa","rel","drt","chn","pli","laa","bal","hdn","hax","mik","ajp","xqa","pal","crk","mni","lut","ayl","ood","sdh","ofs","nus","kiu","diq","qxq","alt","bfz","klj","mus","srn","guc","lim","zea","shi","mnr","bom","sat","szl"]
features = Features({ 'label': ClassLabel(names=class_names), 'text': Value('string')})
num_labels = features['label'].num_classes
data_files = { "train": "train.csv", "test": "test.csv" }
sentences = load_dataset(
"loretoparisi/tatoeba-sentences",
data_files=data_files,
delimiter='\t',
column_names=['label', 'text'],
)
# You can make this part faster with num_proc=<some int>
sentences = sentences.map(lambda ex: features["label"].str2int(ex["label"]) if ex["label"] is not None else None, features=features)
@lhoestq IIRC, I suggested adding cast_to_storage
to ClassLabel
+ table_cast
to the packaged loaders if the ClassLabel
/Image
/Audio
type is present in features
to avoid this kind of error, but your concern was speed. IMO shouldn't be a problem if we do table_cast
only when these features are present.
I agree packaged loaders should support ClassLabel
feature without throwing an error.
@albertvillanova @mariosasko thank you, with that change now I get
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-9-eeb68eeb9bec>](https://localhost:8080/#) in <module>()
11 )
12 # You can make this part faster with num_proc=<some int>
---> 13 sentences = sentences.map(lambda ex: features["label"].str2int(ex["label"]) if ex["label"] is not None else None, features=features)
14 sentences = sentences.shuffle()
8 frames
[/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in validate_function_output(processed_inputs, indices)
2193 if processed_inputs is not None and not isinstance(processed_inputs, (Mapping, pa.Table)):
2194 raise TypeError(
-> 2195 f"Provided `function` which is applied to all elements of table returns a variable of type {type(processed_inputs)}. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects."
2196 )
2197 elif isinstance(indices, list) and isinstance(processed_inputs, Mapping):
TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'int'>. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects.
the error is raised by this
[/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in validate_function_output(processed_inputs, indices)
@mariosasko changed it like
sentences = sentences.map(lambda ex: {"label" : features["label"].str2int(ex["label"]) if ex["label"] is not None else None}, features=features)
to avoid the above errorr.
Any update on this? Is this correct ?
@mariosasko changed it like
sentences = sentences.map(lambda ex: {"label" : features["label"].str2int(ex["label"]) if ex["label"] is not None else None}, features=features)
to avoid the above errorr.
System Info
Who can help?
@LysandreJik
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
ERROR:
while loading without
features
it loads without errorsbut the
label
col seems to be wrong (without theClassLabel
object):The dataset was https://huggingface.co/datasets/loretoparisi/tatoeba-sentences
Dataset format is:
Expected behavior