mmasana / FACIL

Framework for Analysis of Class-Incremental Learning with 12 state-of-the-art methods and 3 baselines.
https://arxiv.org/pdf/2010.15277.pdf
MIT License
524 stars 99 forks source link

Getting "RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2" #4

Closed snokh closed 3 years ago

snokh commented 3 years ago

While executing the python3 -u src/main_incremental.py script, the code is giving the below error:-

image

mmasana commented 3 years ago

There is no line 63 in finetuning.py (see code), and there is no final_target variable in the corresponding criterion() function. I would need more insight into which modifications you have added to be able to guess where the error comes from.

snokh commented 3 years ago

The two lines and 'final_target' were added for debugging purposing only.

I have set up my virtual environment and clone the repo again, and the error still persists. Please find the attached JPG. FACIL bug

mmasana commented 3 years ago

I just cloned the repository from scratch on a fresh machine and didn't get that error (output pasted below). So I´m not sure how to reproduce your error. Some possibilities that come to mind could be:

(base) mmasana@XXX:~/libraries$ git clone https://github.com/mmasana/FACIL.git
Cloning into 'FACIL'...
remote: Enumerating objects: 101, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 101 (delta 0), reused 0 (delta 0), pack-reused 98
Receiving objects: 100% (101/101), 7.62 MiB | 14.58 MiB/s, done.
Resolving deltas: 100% (29/29), done.
(base) mmasana@XXX:~/libraries$ cd FACIL/
(base) mmasana@XXX:~/libraries/FACIL$ ls
docs  environment.yml  LICENSE  README.md  requirements.txt  scripts  src
(base) mmasana@XXX:~/libraries/FACIL$ python3 -u src/main_incremental.py
=========================================================
Arguments =
        approach: finetuning
        batch_size: 64
        clipping: 10000
        datasets: ['cifar100']
        eval_on_train: False
        exp_name: None
        fix_bn: False
        gpu: 0
        gridsearch_tasks: -1
        keep_existing_head: False
        last_layer_analysis: False
        log: ['disk']
        lr: 0.1
        lr_factor: 3
        lr_min: 0.0001
        lr_patience: 5
        momentum: 0.0
        multi_softmax: False
        nc_first_task: None
        nepochs: 200
        network: resnet32
        no_cudnn_deterministic: False
        num_tasks: 4
        num_workers: 4
        pin_memory: False
        pretrained: False
        results_path: ../results
        save_models: False
        seed: 0
        stop_at_task: 0
        use_valid_only: False
        warmup_lr_factor: 1.0
        warmup_nepochs: 0
        weight_decay: 0.0
==========================================================
Approach arguments =
        all_outputs: False
==========================================================
Exemplars dataset arguments =
        exemplar_selection: random
        num_exemplars: 0
        num_exemplars_per_class: 0
==========================================================
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ../data/cifar100/cifar-100-python.tar.gz
100%|#########9| 168468480/169001437 [00:16<00:00, 6700375.69it/s]Extracting ../data/cifar100/cifar-100-python.tar.gz to ../data/cifar100
Files already downloaded and verified
[(0, 25), (1, 25), (2, 25), (3, 25)]
************************************************************************************************************
Task  0
************************************************************************************************************
| Epoch   1, time=  6.3s | Train: skip eval | Valid: time=  0.5s loss=2.688, TAw acc= 19.4% | *
_(program continues on until completion)_
wzjoriv commented 3 years ago

I am getting the same error. All I did was:

python -u src/main_incremental.py --approach finetuning --network resnet34

I am running this on windows 10 powershell.

Output:

(LIGN_test) PS F:\dev\Projects\LIGN\.rug\FACIL> python -u src/main_incremental.py --approach finetuning --network resnet34       
============================================================================================================
Arguments =
        approach: finetuning
        batch_size: 64
        clipping: 10000
        datasets: ['cifar100']
        eval_on_train: False
        exp_name: None
        fix_bn: False
        gpu: 0
        gridsearch_tasks: -1
        keep_existing_head: False
        last_layer_analysis: False
        log: ['disk']
        lr: 0.1
        lr_factor: 3
        lr_min: 0.0001
        lr_patience: 5
        momentum: 0.0
        multi_softmax: False
        nc_first_task: None
        nepochs: 200
        network: resnet34
        no_cudnn_deterministic: False
        num_tasks: 4
        num_workers: 4
        pin_memory: False
        pretrained: False
        results_path: ../results
        save_models: False
        seed: 0
        stop_at_task: 0
        use_valid_only: False
        warmup_lr_factor: 1.0
        warmup_nepochs: 0
        weight_decay: 0.0
============================================================================================================
Approach arguments =
        all_outputs: False
============================================================================================================
Exemplars dataset arguments =
        exemplar_selection: random
        num_exemplars: 0
        num_exemplars_per_class: 0
============================================================================================================
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ../data\cifar100\cifar-100-python.tar.gz
100.0%
Extracting ../data\cifar100\cifar-100-python.tar.gz to ../data\cifar100
Files already downloaded and verified
[(0, 25), (1, 25), (2, 25), (3, 25)]
************************************************************************************************************
Task  0
************************************************************************************************************
C:\Users\josue\anaconda3\envs\LIGN_test\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
  File "src/main_incremental.py", line 316, in <module>
    main()
  File "src/main_incremental.py", line 264, in main
    appr.train(t, trn_loader[t], val_loader[t])
  File "F:\dev\Projects\LIGN\.rug\FACIL\src\approach\incremental_learning.py", line 56, in train
    self.train_loop(t, trn_loader, val_loader)
  File "F:\dev\Projects\LIGN\.rug\FACIL\src\approach\finetuning.py", line 52, in train_loop
    super().train_loop(t, trn_loader, val_loader)
  File "F:\dev\Projects\LIGN\.rug\FACIL\src\approach\incremental_learning.py", line 111, in train_loop
    self.train_epoch(t, trn_loader)
  File "F:\dev\Projects\LIGN\.rug\FACIL\src\approach\incremental_learning.py", line 171, in train_epoch
    loss = self.criterion(t, outputs, targets.to(self.device))
  File "F:\dev\Projects\LIGN\.rug\FACIL\src\approach\finetuning.py", line 61, in criterion
    return torch.nn.functional.cross_entropy(outputs[t], targets - self.model.task_offset[t])
  File "C:\Users\josue\anaconda3\envs\LIGN_test\lib\site-packages\torch\nn\functional.py", line 2824, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2 'target' in call to _thnn_nll_loss_forward
wzjoriv commented 3 years ago

One difference I noticed between your output and mine is the following:

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ../data/cifar100/cifar-100-python.tar.gz

vs

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ../data\cifar100\cifar-100-python.tar.gz

File path format is different.

mmasana commented 3 years ago

yes, that's what I commented earlier:

could this be related to your path system? I noticed that your downloading of CIFAR-100 goes to ../data\cifar100\cifar-100-python.tar.gz, which uses both / and . You can change the data path in dataset_config.py, and the results path on main_incremental.py with the --results-path argument.

Seems like it might be an issue with Windows? The path seems to be wrong. But also seems like the system doesn't complain, so maybe it loads an empty dataset? Which you could check with this:

the dataset is empty? It seems that the vector of targets contains a single value instead of the LongTensor that is expected by the CE loss. You could check that by setting a break point on line 170 of incremental_learning.py. Check if the ´targets´ variable has a list of labels for the batch.

Could you try it and let me know what you get? Also, --network resnet34 should be --network resnet32 if you use small input datasets such as CIFAR-100.

wzjoriv commented 3 years ago

Got it working in Linux with no issues. I will check in windows in the future and let you know.

Btw I had a few questions about your implementations:

Feel free to let me know if you would like me to elaborate on any of the questions

mmasana commented 3 years ago

Yeah, in Linux it should work fine. Answering your questions:

Are the exemplars defined as the number of data used to retrain the model during rehearsal?

Yes, the exemplars are the number of data/images that will be used during rehearsal.

Are the exemplar size enforced during initial training for the encountered labels?

I'm not sure if I understand the question. The exemplars are selected from the training data of that task at the end of its training session.

For fixed memory, will the number of data per class depend on how many labels have been encountered or all the labels that will ever be encountered?

It depends on the labels encountered. The framework's main comparison strength is to enforce that the incremental learning is done without knowledge or access to future tasks/labels, as in a realistic scenario setting. In the case of fixed memory, you have a buffer of X images that is updated after learning each task. Since it is fixed, as more classes are learned, less exemplars per class are available.

How do you set the initial number of classes and step size between tasks?

short answer: Since most scenarios divide the number of classes equally among tasks, that is the default setting. longer answer: The arg --nc-first-task allows to define a larger first task. And providing a list of datasets allows for each of them to be learned one after the other. If you would want another partition of a dataset, you could either define them as separate datasets of the desired length, or by modifying the corresponding dataset code. I recommend the first, since it can be defined entirely into the dataset_config.py and making use of the class_order entry.

Could you elaborate of the role of grid search?

GridSearch was the name we gave it at the beginning, and later we adapted to the Continual Hyperparameter Search defined in "Class-incremental learning: survey and performance evaluation on image classification" and in "A continual learning survey: Defying forgetting in classification tasks". We plan on changing the naming since I agree that it is confusing. In short, it allows to choose the main hyperparameter related to stability-plasticity (aka intransigence-forgetting) at each task without knowledge of future tasks.

How can one set the scenario in which for cifar100 we start with 50 classes, have a step size of 10 and there is either fixed or growing memory, or there is access to the full data set during rehearsal (i.e. retraining)?

You would use --datasets cifar100 with --nc-first-task 50 --num-tasks 6 (instead of steps you define the number of tasks 50-10-10-10-10-10). For fixed memory you would use --num-exemplars X and for growing memory --num-exemplars-per-class X. To have access to all data, you can check the joint.py baseline.

wzjoriv commented 3 years ago

Thanks for your answers

snokh commented 3 years ago

I am successfully able to run your code on WSL2 on Windows 11. Thanks!

VicenteAlex commented 1 year ago

In case this is useful to anyone, I found the same error on windows and it was fixed by forcing the targets to be ".long()"