snumprlab / cl-alfred

Official Implementation of CL-ALFRED (ICLR'24)
https://bhkim94.github.io/projects/CL-ALFRED/
GNU General Public License v3.0
12 stars 2 forks source link

Data download #1

Closed JACK-Chen-2019 closed 5 months ago

JACK-Chen-2019 commented 5 months ago

Do we need to download the entire 1.6TB of data? Alfred give two options (https://github.com/askforalfred/alfred/tree/master/data): Modeling Quickstart (~17GB) - Recommended: Trajectory JSONs and Resnet Features. Full Dataset (~109GB) - Trajectory JSONs, Raw Images, PDDL States, Videos, Full Resnet Features.

1.6TB is too large to easily download, do we need to train on all the data?

bhkim94 commented 5 months ago

Hi @JACK-Chen-2019,

Thank you for having an interest in our work!

Unfortunately, yes, you need the entire one for reproduction. The large size is because 1) we use surrounding views (1 $\rightarrow$ 5 views) and 2) we cache all features of these views randomized by image augmentation used in MOCA for faster training.

To avoid this huge dataset size, you may extract the features "on the fly" (e.g., extract the ResNet features whenever calling models.seq2seq_im_mask.featurize). But, we observe that this on-the-fly extraction slows down training significantly.

Or, you can just omit image augmentation during training and this doesn't need to store the heavy augmented features. Note that this may decrease the overall performance of the used baselines and our model, resulting in different numbers from our paper.

Let me know if you have any further questions. Thanks.

bhkim94 commented 5 months ago

Closing this issue. Feel free to reopen this if you have any questions.