vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] The right way to add derived column when converting csv to hdf5? #1184

Open r3verser opened 3 years ago

r3verser commented 3 years ago

*Description** Hi, cant get it work! Im converting huge CSV to hdf5 and i need to add new column with value derived from existing one. Error occured TypeError: list indices must be integers or slices, not str in line df["md5"] = [hashlib.md5(val.encode('utf-8')).hexdigest() for val in df['address']] code sample:

import vaex as vs
...
for i, df in enumerate(vs.from_csv('../data.csv', chunk_size=100_000, encoding="ISO-8859-1", usecols=columns, dtype=dtypes)):
    df = df.extract()
    df["md5"] = [hashlib.md5(val.encode('utf-8')).hexdigest() for val in df['address']]
    df.export_hdf5(f'./db_{i:02}.hdf5')

Software information

JovanVeljanoski commented 3 years ago

Hi,

Can you please try installing the latest alpha from pip: pip install vaex==4.0.0a13 ?

And try to convert your csv data using this example.

Although I suspect that this like: df["md5"] = [hashlib.md5(val.encode('utf-8')).hexdigest() for val in df['address']] is the problematic one. What does it contain? If you want to store lists, i think you should use the latest alpha and try exporting to an arrow format.

I hope this helps!