liuxu77 / LargeST

LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting (NeurIPS 2023 DB Track)
MIT License
166 stars 14 forks source link

Error when reading raw h5 data of ca #2

Closed tjb-tech closed 1 year ago

tjb-tech commented 1 year ago

Hi, I am trying to use your ca data. But when I read raw h5 data of ca in 2021 utilizing the following code, the error ValueError: cannot reindex on an axis with duplicate labels occured.

import pandas as pd
import h5py
year = '2021'  # please specify the year, our experiments use 2019
ca_his = pd.read_hdf('/root/paddlejob/workspace/env_run/afs/raw_data/d7/'+ 'ca_his_raw_' + year + '.h5')
print(ca_his.head())
### please comment this line if you don't want to do resampling
# ca_his = ca_his.resample('15T').mean().round(0)5
###

ca_his = ca_his.fillna(0)
print('check null value number', ca_his.isnull().any().sum())
print(ca_his)

the full error:

Traceback (most recent call last):
  File "/root/paddlejob/workspace/env_run/data_process/pre_process.py", line 4, in <module>
    ca_his = pd.read_hdf('/root/paddlejob/workspace/env_run/afs/raw_data/d7/'+ 'ca_his_raw_' + year + '.h5')
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/pytables.py", line 446, in read_hdf
    return store.select(
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/pytables.py", line 866, in select
    return it.get_result()
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/pytables.py", line 1937, in get_result
    results = self.func(self.start, self.stop, where)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/pytables.py", line 850, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/pytables.py", line 3200, in read
    out = out.reindex(columns=items, copy=False)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 5055, in reindex
    return super().reindex(
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5360, in reindex
    return self._reindex_axes(
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 4890, in _reindex_axes
    frame = frame._reindex_columns(
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 4932, in _reindex_columns
    new_columns, indexer = self.columns.reindex(
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py", line 4274, in reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels

BTW, when I use the similar code to read raw h5 data of ca in 2019, everything went well. May I ask if you could deal with it? If there is something wrong with your uploaded raw data, could you please check 5 your raw data and re-upload the correct data? Thanks a lot! Looking forward to hearing you.

tjb-tech commented 1 year ago

Sorry, I found the reason why the error occured. I upload the origonal .h5 data directly to the server, and the error occured due to the uploading process. I re-upload the .zip data, everything went well in 2021 data. Thanks a lot!

liuxu77 commented 1 year ago

Hi, good to know the issue is solved. Thanks for using the data.