libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.81k stars 180 forks source link

MemoryError: Not enough memory to fit the whole sample #209

Open yajivunev opened 2 years ago

yajivunev commented 2 years ago

Hello!

I'm having trouble writing my dataset.

The batches are already made as numpy arrays, so I just have to feed them through a dataset so it can be written to .beton. Each batch has 4 numpy arrays (see field dictionary below), and all batches are in a dictionary called batches with indices as keys. batches[n] would return a tuple of the 4 numpy arrays. I even stack them by field to resemble to LinearRegressionDataset example in the docs. i.e, I do batches = [np.stack([batches[i][j] for i in range(len(batches))]) for j in range(4)], just in case.

My dataset class is:

class Dataset:

    def __init__(self,batches):

        self.raw = batches[0]
        self.labels_mask = batches[1]
        self.gt_affs = batches[2]
        self.affs_weights = batches[3]

    def __len__(self):

        return len(self.raw)

    def __getitem__(self,index):

        return (self.raw[index],self.labels_mask[index],self.gt_affs[index],self.affs_weights[index])

Here is my writer and write command:

writer = DatasetWriter(os.path.join(data_dir,'test.beton'), {
            'raw': NDArrayField(shape=(1, 48, 196, 196), dtype=np.dtype('float32')),
            'labels_mask': NDArrayField(shape=(1, 28, 104, 104), dtype=np.dtype('uint8')),
            'gt_affs': NDArrayField(shape=(3, 28, 104, 104), dtype=np.dtype('uint8')),
            'affs_weights': NDArrayField(shape=(1, 3, 28, 104, 104), dtype=np.dtype('float32')),
}, num_workers=1)

writer.from_indexed_dataset(dataset,chunksize=1)

This always results in the follow traceback and hangs until I ctrl+c:

Traceback (most recent call last):
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 113, in worker_job_indexed_dataset
    handle_sample(sample, dest_ix, field_names, metadata, allocator, fields)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 51, in handle_sample
    field.encode(destination, field_value, allocator.malloc)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/fields/ndarray.py", line 98, in encode
    destination[0], data_region = malloc(self.element_size)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_allocator.py", line 65, in malloc
    raise MemoryError("Not enough memory to fit the whole sample")
MemoryError: Not enough memory to fit the whole sample

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 117, in worker_job_indexed_dataset
    done_number.value += len(chunk)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_allocator.py", line 119, in __exit__
    self.flush_page()
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/memory_allocator.py", line 84, in flush_page
    assert self.page_offset != 0
AssertionError
^CTraceback (most recent call last):
  File "/scratch1/04101/vvenu/autoseg/cremi/02_train/multi_gpu_test/mkdata_copy.py", line 295, in <module>
    writer.from_indexed_dataset(dataset,chunksize=1)
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 297, in from_indexed_dataset
    self._write_common(len(indices), chunks(indices, chunksize),
  File "/scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/site-packages/ffcv/writer.py", line 255, in _write_common
    sleep(0.1)
KeyboardInterrupt
  0%|                                                                                                                                                                          | 0/10 [04:29<?, ?it/s]^C
(ffcv) c196-011[rtx](1064)$ /scratch1/04101/vvenu/miniconda3/envs/ffcv/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Each batch is about 10mb.

Appreciate any help! Thank you so much!! Cheers on the project, would love to get it working for my case :)

sajanth commented 1 year ago

@yajivunev Were you able to solve this? Currently struggling with the same issue. Originally I would get a page size error, but after ramping up the page size I get the same error as you, even though a single sample should easily fit into my memory.

Update: I was able to resolve this issue by further increasing the page size even more