waymo-research / waymo-open-dataset

Waymo Open Dataset
https://www.waymo.com/open
Other
2.7k stars 611 forks source link

Efficiently loading Waymo raw data? #856

Open aryasenna opened 3 months ago

aryasenna commented 3 months ago

Hello,

I'm using v2.0.0 dataset and successfully followed the example on loading Waymo data using dask.

This is all fine for quick testing, but when I use the same method on my data loader things do not scale so well. Dask is nice but when I actually call Dask's compute() to get the data, it takes sometime even with fast disk.

Data loader when shuffled will sample the frame randomly so I can't have eager loading of each parquet file by looking at its order.

Example: when it happens that the the loader samples 10 different frames from 10 different parquet files, then it becomes an I/O bottleneck, even with multiple workers.

Preloading the whole dataset is out of question due to memory constraint.

I have been looking at how other frameworks (e.g. Mmdetect) and 3rd party libraries (e.g. Pytorch Waymo loader) are using Waymo: they pre-convert the training frames (e.g. to pickle) so the access time is fast even when frames are randomly sampled.

Is this the recommend way? I feel the use of parquet file + dask is meant to address this exact issue.

Thanks in advance for the insight.

aryasenna commented 3 months ago

In case anyone is wondering:

One possible solution, depending on your use case is to use "push down filtering".

Too bad that the Waymo v2 example/tutorial never mentioned the use of Dask's filtering.

The filtering should be done when you first read your parquet file:

e.g.

image_df = dd.read_parquet(
    os.path.join(directories['CameraImage'], context_name + '.parquet'),
    columns=['key.frame_timestamp_micros', 'key.camera_name', '[CameraImageComponent].image'],
    filters=[('key.frame_timestamp_micros', '==', timestamp), ('key.camera_name', '==', CameraName.FRONT.value)]
)

This approach works for me because my training loader only expects one timestamp and a certain camera. The idea is to make sure you only load part of the parquet.

I will leave the issue open for visibility in case the Waymo team wants to update their documentation.

JingweiJ commented 3 months ago

Yes push down filtering is a good way for efficiency. We mentioned this a bit in the "A relational database-like structure" section in the example/tutorial yet we should discuss more in the aspect of efficiency. Thanks for the advice!

aryasenna commented 3 months ago

@JingweiJ Thanks for checking this issue. Yes you're correct pushdown filtering there in the short comment. So technically, it was "mentioned".

My point being, in the actual example code where it only uses single frame, it makes sense to use push filtering by default. 🙂