StatCan Table being Pulled Too Large -- Any Work Around?

jackhere-lab commented 3 weeks ago

I want to download the CANSIM table 12-10-0128-01 using the stats_can library. However I am getting an HDF5 error. I can load smaller tables without issues. Is there any workaround to this?

ianepreston commented 2 weeks ago

Hi @jackhere-lab, It's difficult for me to help troubleshoot without the actual error message and some information on your environment. I haven't encountered this error personally, can you provide more details?

jackhere-lab commented 2 weeks ago

I am running the code in Azure Databricks. Here is the full error message:

HDF5ExtError: HDF5 error back trace

File "H5D.c", line 1371, in H5Dwrite can't synchronously write data File "H5D.c", line 1317, in H5Dwrite_api_common can't write data File "H5VLcallback.c", line 2282, in H5VL_dataset_write_direct dataset write failed File "H5VLcallback.c", line 2237, in H5VLdataset_write dataset write failed File "H5VLnative_dataset.c", line 420, in H5VLnative_dataset_write can't write data File "H5Dio.c", line 824, in H5Dwrite can't write data File "H5Dchunk.c", line 3295, in H5Dchunk_write unable to read raw data chunk File "H5Dchunk.c", line 4626, in H5Dchunk_lock unable to preempt chunk(s) from cache File "H5Dchunk.c", line 4286, in H5Dchunk_cache_prune unable to preempt one or more raw data cache entry File "H5Dchunk.c", line 4138, in H5Dchunk_cache_evict cannot flush indexed storage buffer File "H5Dchunk.c", line 4061, in H5Dchunk_flush_entry unable to write raw data to file File "H5Fio.c", line 179, in H5F_shared_block_write write through page buffer failed File "H5PB.c", line 992, in H5PB_write write through metadata accumulator failed File "H5Faccum.c", line 821, in H5Faccum_write file write failed File "H5FDint.c", line 318, in H5FD_write driver write request failed File "H5FDsec2.c", line 808, in H5FD__sec2_write file write failed: time = Tue Nov 12 19:22:38 2024 , filename = '/Workspace/Users/######@###ex.com/stats_can.h5', file descriptor = 90, errno = 27, error message = 'File too large', buf = 0x78770d0, total write size = 259856, bytes this sub-write = 259856, bytes actually written = 18446744073709551615, offset = 0

End of HDF5 error back trace

Problems appending the records. File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/stats_can/sc.py:335, in table_from_h5(table, h5file, path) 334 with pd.HDFStore(h5, "r") as store: --> 335 df = pd.read_hdf(store, key=table) 336 except (KeyError, OSError): File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/tables/tableextension.pyx:542, in tables.tableextension.Table._append_records()

ianepreston commented 1 week ago

Gotcha. I've been using the library in databricks recently and the hdf backend the library uses for retention and updating really doesn't play nice with it. To be honest the whole idea of using hdfs to store things, or handle table retention in the library at all was a mistake on my part. An upcoming release is going to rip all that out and focus on just retrieving data from the API and getting it into a dataframe, leaving storage and updating to other tools.

I'd recommend just using stats_can.sc.download_tables and stats_can.sc.zip_table_to_dataframe with the path set to a dbfs mount or unity catalog volume. That will download the zipped CSV and then extract it into a pandas dataframe. From there you can take over with spark or write the dataframe out or whatever else you need to do.

You could even do just the download_tables part and then unzip it and read it in with spark directly if you want a more databricks native way to do things.

Hope that helps

ianepreston / stats_can

StatCan Table being Pulled Too Large -- Any Work Around? #567