Efficiently loading Waymo raw data?

aryasenna commented 4 months ago

Hello,

I'm using v2.0.0 dataset and successfully followed the example on loading Waymo data using dask.

This is all fine for quick testing, but when I use the same method on my data loader things do not scale so well. Dask is nice but when I actually call Dask's compute() to get the data, it takes sometime even with fast disk.

Data loader when shuffled will sample the frame randomly so I can't have eager loading of each parquet file by looking at its order.

Example: when it happens that the the loader samples 10 different frames from 10 different parquet files, then it becomes an I/O bottleneck, even with multiple workers.

Preloading the whole dataset is out of question due to memory constraint.

I have been looking at how other frameworks (e.g. Mmdetect) and 3rd party libraries (e.g. Pytorch Waymo loader) are using Waymo: they pre-convert the training frames (e.g. to pickle) so the access time is fast even when frames are randomly sampled.

Is this the recommend way? I feel the use of parquet file + dask is meant to address this exact issue.

Thanks in advance for the insight.

aryasenna commented 4 months ago

In case anyone is wondering:

One possible solution, depending on your use case is to use "push down filtering".

Too bad that the Waymo v2 example/tutorial never mentioned the use of Dask's filtering.

The filtering should be done when you first read your parquet file:

e.g.

image_df = dd.read_parquet(
    os.path.join(directories['CameraImage'], context_name + '.parquet'),
    columns=['key.frame_timestamp_micros', 'key.camera_name', '[CameraImageComponent].image'],
    filters=[('key.frame_timestamp_micros', '==', timestamp), ('key.camera_name', '==', CameraName.FRONT.value)]
)

This approach works for me because my training loader only expects one timestamp and a certain camera. The idea is to make sure you only load part of the parquet.

I will leave the issue open for visibility in case the Waymo team wants to update their documentation.

JingweiJ commented 4 months ago

Yes push down filtering is a good way for efficiency. We mentioned this a bit in the "A relational database-like structure" section in the example/tutorial yet we should discuss more in the aspect of efficiency. Thanks for the advice!

aryasenna commented 4 months ago

@JingweiJ Thanks for checking this issue. Yes you're correct pushdown filtering there in the short comment. So technically, it was "mentioned".

My point being, in the actual example code where it only uses single frame, it makes sense to use push filtering by default. 🙂

nlgranger commented 3 weeks ago

In case it can help anyone, I have written a library which can load any given data sample from Waymo (and also KITTI, NuScenes or ZOD). As we discussed in https://github.com/waymo-research/waymo-open-dataset/issues/841, one needs to re-encode the parquet files to make random access fast.

The library is here: https://github.com/CEA-LIST/tri3d

It is a bit opinionated because I needed to settle on common conventions across datasets, but I think you'll find it does what you expect it to do most of the time. Notably, it has sane defaults to interpolate poses (ego car, boxes, sensors) such that when you request something at, say, LiDAR frame 12, that something will actually overlap well with the point cloud.

waymo-research / waymo-open-dataset

Efficiently loading Waymo raw data? #856