FFCV doesn't work for large dataset

richardrl commented 2 months ago

I am trying to load a 600GB dataset.

It froze for one hour on np.from_file in ffcv -> ffcv -> reader.py line 70 before I gave up and cancelled it.

I tried to fix this by using np.memmap.

        alloc_table = np.memmap(self._fname, dtype=ALLOC_TABLE_TYPE,
                                  offset=offset, shape=file_size, mode="r+")
        # alloc_table = np.fromfile(self._fname, dtype=ALLOC_TABLE_TYPE,
        #                           offset=offset)

The first time I did this, for some reason the subsequent code changed my 262GB Beton file to 6.2TB.

I need to recreate the beton now to try with just the read flag for memmap to see if I can get this working. Otherwise any tips?

richardrl commented 2 months ago

        alloc_table = np.memmap(self._fname, dtype=ALLOC_TABLE_TYPE,
                                  offset=int(offset), shape=int(file_size/ALLOC_TABLE_TYPE.itemsize), mode="r")

I applied this fix and this seems to make that part of the code finish immediately. What other changes are needed to support the large dataset regime?

Also, the length of my dataset can be very large in some cases - up to 100 million frames. I wonder if there is a code bottleneck there as well for FFCV.

richardrl commented 2 months ago

I made the dataset length much smaller and I'm able to load my dataloader:

train_dataloader = Loader(cfg.dataloader.beton_path,
                                  batch_size=cfg.dataloader.batch_size,
                                  num_workers=cfg.dataloader.num_workers,
                                  order=TemporalClipOrder,
                                  pipelines=PIPELINES,
                                  order_kwargs=dict(
                                      metadata_dict=torch.load(cfg.dataloader.metadata_path),
                                      num_clips=cfg.dataloader.order_kwargs.num_clips,
                                      sequence_length=cfg.horizon,
                                      pad_before=cfg.dataloader.order_kwargs.pad_before,
                                      pad_after=cfg.dataloader.order_kwargs.pad_after,
                                      frame_skip=cfg.dataloader.order_kwargs.frame_skip,
                                      artificial_video_ends=cfg.dataloader.order_kwargs.artificial_video_ends
                                  ),
                                  os_cache=False,
                                  drop_last=True
                                  )

However, I get a segfault immediately when I try to access it. I presume this is due to using the memmap. Is there a suggestion for how to make this whole setup work with the large dataset?

Segfault happens right after the pdb:

 with tqdm.tqdm(train_dataloader, desc=f"Training epoch {self.epoch}", 
                        leave=True, mininterval=cfg.training.tqdm_interval_sec) as tepoch:
                    import pdb
                    pdb.set_trace()
                    for batch_idx, batch in enumerate(tepoch):

richardrl commented 2 months ago

I'm finding that both for the initial beton creation and the initial dataloader load, it's requiring the full sizes of the dataset to load into memory - getting OOM errors otherwise. This is even with os_cache=False.

richardrl commented 2 months ago

I got things to work by changing num_workers = 0 after getting the beton created. I'm not sure why, but this seems related: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

I guess FFCV is doing something internally that blows up the memory upon Loader(...) when num_workers > 0.

Still have not tested if things work during beton creation stage, was getting OOM there unless I had more memory than the dataset size. I was using 60 workers.

libffcv / ffcv

FFCV doesn't work for large dataset #389