Open mpickard-dataprof opened 6 months ago
Hello!
This is due to how pandas write numpy arrays to csv. Source To fix this, you can convert them to list yourselves.
df = ds.to_pandas()
df['int'] = df['int'].apply(lambda arr: list(arr))
df.to_csv(index=False, '../output/temp.csv')
I think it would be good if datasets
would do the conversion itself, but it's a breaking change and I would wait for the greenlight from someone from HF.
Describe the bug
The
to_csv()
method does not output commas in lists. So when the Dataset is loaded back in the data structure of the column with a list is not correct.Here's an example:
Obviously, it's not as trivial as inserting commas in the list, since its a comma-separated file. But hopefully there's a way to export the list in a way that it'll be imported by
load_dataset()
correctly.Steps to reproduce the bug
Here's some code to reproduce the bug:
temp.csv then contains:
pokemon,type,int bulbasaur,grass,[ 98 117 108 98 97 115 97 117 114] squirtle,water,[115 113 117 105 114 116 108 101]
pokemon,type,int bulbasaur,grass,[98, 117, 108, 98, 97, 115, 97, 117, 114] squirtle,water,[115, 113, 117, 105, 114, 116, 108, 101]
pokemon,type,int bulbasaur,grass,"[98, 117, 108, 98, 97, 115, 97, 117, 114]" squirtle,water,"[115, 113, 117, 105, 114, 116, 108, 101]"