huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.18k stars 2.67k forks source link

Dataset.to_csv() missing commas in columns with lists #6778

Open mpickard-dataprof opened 6 months ago

mpickard-dataprof commented 6 months ago

Describe the bug

The to_csv() method does not output commas in lists. So when the Dataset is loaded back in the data structure of the column with a list is not correct.

Here's an example:

Obviously, it's not as trivial as inserting commas in the list, since its a comma-separated file. But hopefully there's a way to export the list in a way that it'll be imported by load_dataset() correctly.

Steps to reproduce the bug

Here's some code to reproduce the bug:

from datasets import Dataset

ds = Dataset.from_dict(
    {
        "pokemon": ["bulbasaur", "squirtle"],
        "type": ["grass", "water"]
    }
)

def ascii_to_hex(text):
    return [ord(c) for c in text]

ds = ds.map(lambda x: {"int": ascii_to_hex(x['pokemon'])})

ds.to_csv('../output/temp.csv')

temp.csv then contains:


### Expected behavior

ACTUAL OUTPUT:

pokemon,type,int bulbasaur,grass,[ 98 117 108 98 97 115 97 117 114] squirtle,water,[115 113 117 105 114 116 108 101]


EXPECTED OUTPUT:

pokemon,type,int bulbasaur,grass,[98, 117, 108, 98, 97, 115, 97, 117, 114] squirtle,water,[115, 113, 117, 105, 114, 116, 108, 101]


or probably something more like this since it's a CSV file:

pokemon,type,int bulbasaur,grass,"[98, 117, 108, 98, 97, 115, 97, 117, 114]" squirtle,water,"[115, 113, 117, 105, 114, 116, 108, 101]"



### Environment info

### Package Version
Name: datasets
Version: 2.16.1

### Python
version: 3.10.12

### OS Info
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
...
UBUNTU_CODENAME=jammy
Dref360 commented 6 months ago

Hello!

This is due to how pandas write numpy arrays to csv. Source To fix this, you can convert them to list yourselves.

df = ds.to_pandas()
df['int'] = df['int'].apply(lambda arr: list(arr))
df.to_csv(index=False, '../output/temp.csv')

I think it would be good if datasets would do the conversion itself, but it's a breaking change and I would wait for the greenlight from someone from HF.