mboudiaf / pytorch-meta-dataset

A non-official 100% PyTorch implementation of META-DATASET benchmark for few-shot classification
60 stars 9 forks source link

ResNet structure #13

Open chmxu opened 2 years ago

chmxu commented 2 years ago

Hi,

The original tensorflow implementation uses the standard structure for the first convolution layer, i.e., 7x7 kernel size, stride 2, padding 3 and a 3x3 max pooling layer after that (link) while in your implementation this layer is used with 3x3 kernel size and without max pooling (link). In this way the resulted feature map is way larger and costs more memory. I also notice that in the PAMI version of TIM the authors claim that the pytorch version of baselines are much better than the original version. I wonder if the performance boost comes from this modification. The 'larger' version of resnet seems not so practical for meta-dataset, since it will lead to OOM when being trained with the ProtoNet or other episodic methods. I don't know if I have any misunderstanding about the code.

Thanks.

chmxu commented 2 years ago

Also, when I use the default script for episodic training the usage of RAM increases dramatically during training. The model can use about 100G RAM after about 300 iterations. I don't know if this is reasonable.

mboudiaf commented 2 years ago

Hi,

Thanks for raising this issue. Let me investigate on both problems and get back to you ASAP.

Update : 1) Could you try again and let me know if the RAM problem is solved ? 2) As for the resnet structure, there is indeed some discrepancy in the litterature between resnet18 (implemented in my code) and the custom resnet 12 used in several few-shot works. I will add the latter architecture soon.

chmxu commented 2 years ago

Hi, thank you for your reply! I have tried to modify the training script based on your new version to skip the model forward and backward and only iterate the dataloader and print the memory usage as follow

    import psutil
    for i, data in enumerate(tqdm_bar):
        if i >= args.num_updates:
            break

        print("PERCENTAGE RAM USED", psutil.virtual_memory().percent)
        continue

In my trial the percentage of used memory keeps increasing. I think there may be some potential memory leakage when reading the tfrecord files but I cannot figure it out. My pytorch version is 1.9.0, with cuda 11.1. Maybe you can try my code to see if you can reproduce the problem.

mboudiaf commented 2 years ago

I have tried my new code before pushing, and I had no memory leakage. When choosing my pytorch loader, the RAM capped at 16.5 GB. Please can you confirm that by running on my original code:

bash scripts/train.sh protonet resnet18 ilsvrc_2012

you don't have any leakage ? Thanks.

chmxu commented 2 years ago

I re-clone the repo and trained the protonet with your original code. After 1400 iterations 23G RAM is used. When I train the model with 4 GPUs (by modifying the gpu configuration in base.yaml), about 80G RAM is used at 1100 iterations. And the usage keeps increasing slowly in both cases.

I assume the RAM used by the model is correlated with number of GPUs (since DDP is used) and size of an episode. In this way when the episode is large, which is exact the case in meta-dataset where the largest support set can contain 500 images, and when I want to use multiple GPUs, the code may use incredibly large RAM. I wonder if there is any solution to this problem.