Open nlgranger opened 5 months ago
Hi @nlgranger, I'm also facing problem with slow access time (see #856) when I use Waymo dataset parquets directly for training
Sorting help, but this must be done on every parquet load, and goes without saying sorting themselves takes time.
Short of re-encoding 1.2TB worth of parquet files, do you perhaps have a better trick?
Nope, I just re-encoded.
Nope, I just re-encoded.
Ouch, how long did it take you to re-encode few TBs of parquets?
About a day on a 48 cpu server I think. It's not that slow. Make sure to enable brotli compression otherwise sparse data such as lidar return maps will take a huge space.
About a day on a 48 cpu server I think. It's not that slow. Make sure to enable brotli compression otherwise sparse data such as lidar return maps will take a huge space.
Hi, I tried your snippet but it's giving me some OOM error despite loads of free RAM.
Do you mind sharing your re-encoding script? Thank you very much in advance. 🙂
What's the rationale behind row_group_size
of 4? This seems quite small and possibly inefficient
In random access, you will only ever need a single row of each group so you want it to be as small as possible. For point cloud data 4 rows is already in the order of a hundreds of kb so compression will be close to the maximum ratio it can achieve. Feel free to make your own tests and adjust to your taste.
About a day on a 48 cpu server I think. It's not that slow. Make sure to enable brotli compression otherwise sparse data such as lidar return maps will take a huge space.
Hi, I tried your snippet but it's giving me some OOM error despite loads of free RAM.
Do you mind sharing your re-encoding script? Thank you very much in advance. 🙂
Alright, I fixed my script.
With the threaded approach, re-sorting a few hundred GB parquets finished in just under 2 hours, so I think it's a reasonable tradeoff; of course, this depends on your I/O and CPU count (I happen to have access to fast disks and lots of CPU core).
This stuff does need lots of memory, I have to limit my thread count (example 8 seems to work with a 300 GB RAM cluster)
In case anyone needs it:
Fetching individual rows identified by sensor and timestamp in the parquet files is slow.
Simply re-encoding the files with better options can significantly improve the access time.
Baseline:
Sorted and row group size:
Note: Sorting by timestamp first, then sensor is 4 times faster than the opposite on my computer.