libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.82k stars 178 forks source link

Tried allocating 144000000 but page size is {self.page_size} #135

Closed bolandih closed 2 years ago

bolandih commented 2 years ago

Hi,

I am trying to convert my dataset into ffcv format. There are 3 inputs, img, ldx and, ldy. img has size (2, 600, 600), ldx has size(100,1,600,600) and, ldy has same size as ldy. Output has size (100,600,600) and all the inputs and outputs are float32. Witer works fine but writer.from_indexed_dataset(train_dataset) throws error:

"Tried allocating 144000000 but page size is {self.page_size}".

from dataset import dataset

num=20
train_dataset = dataset(num, is_train=True)

write_path = '/ffcv_test/d.beton'

for img, ldx, ldy, output in train_dataset:
       writer = DatasetWriter(write_path, {
        'img': NDArrayField(shape=(2, 600, 600), dtype=np.dtype('float32')),
        'ldx': NDArrayField(shape=(100,1, 600, 600), dtype=np.dtype('float32')),
        'ldy': NDArrayField(shape=(100, 1, 600, 600), dtype=np.dtype('float32')),
        'output': NDArrayField(shape=(100,600,600), dtype=np.dtype('float32')),
        }, num_workers=16)    

writer.from_indexed_dataset(train_dataset)
GuillaumeLeclerc commented 2 years ago

Your samples are more than 8MB: Which is the default in FFCV. In your case you need to set your page size to be a multiple of 416 MB. (See the options on the DatasetWriter class).

GuillaumeLeclerc commented 2 years ago

https://docs.ffcv.io/api/writer.html

On Mon, Feb 7, 2022, 10:18 AM H-B-L @.***> wrote:

@GuillaumeLeclerc https://github.com/GuillaumeLeclerc , thanks for your comment. Can you please provide me a link how to change the page size? I could not find it in the repository.

— Reply to this email directly, view it on GitHub https://github.com/libffcv/ffcv/issues/135#issuecomment-1031581932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPMOG2EDUN6UT6VYC6GL3DUZ7PFBANCNFSM5NVRKVBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

bolandih commented 2 years ago

@GuillaumeLeclerc thanks for your help. I changed MIN_PAGE_SIZE = 1 << 21 to MIN_PAGE_SIZE = 1 << 31 and worked. but when I am trying to load data with loader like below:


from ffcv.loader import Loader, OrderOption
from ffcv.fields.decoders import NDArrayDecoder, FloatDecoder
from ffcv.loader import OrderOption
from ffcv.transforms import ToTensor

batch_size = 8
num_workers = 8

ORDERING = OrderOption.RANDOM

PIPELINES = {
  'img': [NDArrayDecoder(), ToTensor()],
  'ldx': [NDArrayDecoder(), ToTensor()],
  'ldy': [NDArrayDecoder(), ToTensor()],
  'output': [NDArrayDecoder(), ToTensor()]
}

write_path = '/ffcv_test/d.beton'

loader = Loader(write_path,
                batch_size=batch_size,
                num_workers=num_workers,
                order=ORDERING,
                pipelines=PIPELINES)

raised with error: File "/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/base.py", line 52, in init page_size_bit_location = int(np.log2(reader.page_size)) OverflowError: cannot convert float infinity to integer

I printed 'self.page_size = header['page_size']' in Reader class and it was zero!

bolandih commented 2 years ago

I passed the page_size directly to page_size_bit_location = int(np.log2(reader.page_size)) like page_size_bit_location = int(np.log2(8589934592)) in base.py and worked.

GuillaumeLeclerc commented 2 years ago

@H-B-L Why would you change FFCV code when there is an argument that you can simply pass to the DatasetWriter ?

class ffcv.writer.DatasetWriter(fname: str, fields page_size: int = 8388608, num_workers: int = - 1)
bolandih commented 2 years ago

@GuillaumeLeclerc Thanks for mentioning that, later I used the argument.

bolandih commented 2 years ago

@GuillaumeLeclerc Since my dataset is large I use os_cache=False, QUASI_RANDOM and raised with error below

/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/manager.py:36: RuntimeWarning: divide by zero encountered in log2 page_size_log2 = np.uint32(np.log2(page_size))

Please note that when I am not passing these arguments, Loader works fine for small part of data but if increase number of samples it freezes after epoch 0.

GuillaumeLeclerc commented 2 years ago

Which arguments? They're should be a single one. Make sure toi don't change the global variables like you suggested earlier on. It seems that you page size is zero here. Are you sure the value you passed is a multiple of 2MB ?

quasi random shouldn't matter in your case you can use regular random. Your samples are so big that it doesn't matter. The goal of quasi random is to group the samples a little so that you don't need to keep too baby pages in ram but in you case your sample fill a whole page so as soon as you trained on it it can be discarded so you don't have memory overhead.

On Wed, Feb 9, 2022, 10:40 AM H-B-L @.***> wrote:

@GuillaumeLeclerc https://github.com/GuillaumeLeclerc Since my dataset is large I use os_cache=False, QUASI_RANDOM and raised with error below

/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/manager.py:36: RuntimeWarning: divide by zero encountered in log2 page_size_log2 = np.uint32(np.log2(page_size))

Please note that when I am not passing these arguments, Loader works fine for small part of data but if increase number of samples it freezes after epoch 0.

— Reply to this email directly, view it on GitHub https://github.com/libffcv/ffcv/issues/135#issuecomment-1033896448, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPMOG42CTY7XHHGQME5GVDU2KDHBANCNFSM5NVRKVBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

bolandih commented 2 years ago

Thanks for your feedback! I changed back any global variable that I have modified earlier. I just passed page_size=4294967269 into the DatasetWriter and regenerate dataset again. When I start training, throws error below.

File "/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/base.py", line 53, in init page_size_bit_location = int(np.log2(reader.page_size)) OverflowError: cannot convert float infinity to integer

If I pass the page size directly in the base.py as I mentioned earlier Loader will work without any error but the process will be killed after epoch zero. The size of dataset_train.beton is 194 GB.

You can see how I arranged loader in Pytorch Lighting below:

def train_dataloader(self):      
        PIPELINES = {
          'img': [NDArrayDecoder(), ToTensor()],
          'ldx': [NDArrayDecoder(), ToTensor()],
          'ldy': [NDArrayDecoder(), ToTensor()],
          'output': [NDArrayDecoder(), ToTensor()]}                      
        path0 = os.getcwd()
        path = os.path.join(path0, 'ffcv_test', 'dataset_train.beton')
        ORDERING = OrderOption.RANDOM      
        train_loader = Loader(path,
                        batch_size=batch_size,
                        num_workers=num_workers,
                        order=ORDERING,
                        drop_last=False,
                        distributed = False,                      
                        os_cache=False,
                        pipelines= PIPELINES)                    
        return train_loader
GuillaumeLeclerc commented 2 years ago

It's hard for me to investigate without a script that I can run.

Could you add a breakpoint there and tell us what the values of the variables are: File "/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/base.py", line 53, in init

Otherwise could you provide a minimal complete script (with the DataWriter using dummy data) so that we can see what is wrong on our end ?

bolandih commented 2 years ago

@GuillaumeLeclerc You can download minimal complete script and dummy input and output data from this repository. I change the size of data from (100,600,600) to (10,60,60) to be able to push it into repository. For me FFCV_loader throws this error :

File "/home/user/anaconda3/lib/python3.9/site-packages/ffcv/memory_managers/base.py", line 52, in init page_size_bit_location = int(np.log2(reader.page_size)) OverflowError: cannot convert float infinity to integer

GuillaumeLeclerc commented 2 years ago

Sorry for the delay. I'm looking at the bug right now

GuillaumeLeclerc commented 2 years ago

Hello, your issue is that your page size is 4GiB not 409MiB as your comment suggest. The maximum page size for FFCV is currently 4290772992 bytes (< 4GiB). I will throw an error in the next version if you pass a page size bigger than this threshold

realliyifei commented 1 year ago

After I define the page_size in the dataest writer and generate the beton data, the loader will stop in the middle. 

Is there any corresponding stuff to add in the pipeline after I changed the page_size of the dataset writer?