vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Add timezone-aware timestamps to supported type #2359

Open KIC opened 1 year ago

KIC commented 1 year ago

I have mor then 500k indvidual csv files with an average size >~ 1000 records and I want to convert them to one hdf5 file. Plain pandas read_csv and to_hdf with append=True would take a week or so. vaex see to support this but then I have 2 problems:

1) I need to add a constant column to each csv -> df (the name of the file) => I can't use read_csv(convert=True) 2) I have timestamps from all over the planet and I expect them to keep the timezone as I will need it later again => can't df.export

Discussed in https://github.com/vaexio/vaex/discussions/2008

Originally posted by **Hasham04** April 10, 2022 I am reading a parquet file and one of the date-time columns is of type timestamp[ms, tz=UTC]. I have tried converting this to `df['time'].astype('datetime64[ms]') ` `df['time'].astype('timestamp')` but I always get the error ` raise NotImplementedError(f'Cannot convert {arrow_type}') NotImplementedError: Cannot convert timestamp[ms, tz=UTC] `. Any suggestions on how to convert this to a supported type. I can print the data frame but the second I try to interact with this column like doing `df.dtypes` I get this error. I want to convert this column to a sting ideally so that I can concat it with another string column. thanks for your help.