How to adapt to our own databases/architecture?

Rubiel1 commented 4 years ago

Hey I am very interested in this work, can you easy my job and indicate which lines do I need to customize to distillate my own dataset with Xavier initialization (Random initialization according to your paper) and a particular architecture not on your list?

Rubiel1 commented 4 years ago

I guess I have to work on datasets/init.py and add all the corresponding lines, like (https://github.com/SsnL/dataset-distillation/blob/24e6958cf23bc98de20a7b2714fc76fa7d38b874/datasets/__init__.py#L12), also based on my architecture Ill have to personalize the data transformations.

ilia10000 commented 4 years ago

Yep, so you'll need to modify the datasets/init.py exactly as you described. To add an architecture you'll need to modify networks/networks.py and if you have any layer types that haven't already been covered by existing architectures in the repo then you will also need to update the init_weights function in networks/utils.py. I think that was everything but let me know if you run into any problems and I can try to help debug.

ssnl commented 4 years ago

:) @ilia10000 is correct. Feel free to let us know if you run into issues.

Rubiel1 commented 4 years ago

So, I have rgb images, I crop them to 224 by 224 and use AlexNet but I got RuntimeError: CUDA error: device-side assert triggered.

python main.py --mode distill_basic --dataset newsci --arch AlexNet
INFO:root:Logging to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/output.log
WARNING:root:Log file already exists, will append
2020-11-25 21:22:02 [INFO ]  ======================================== 2020-11-25 21:22:02 ========================================
2020-11-25 21:22:02 [INFO ]  Base directory is ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)
2020-11-25 21:22:02 [INFO ]  Options: 
2020-11-25 21:22:02 [INFO ]     arch: AlexNet
2020-11-25 21:22:02 [INFO ]     attack_class: 0
2020-11-25 21:22:02 [INFO ]     base_seed: 1
2020-11-25 21:22:02 [INFO ]     batch_size: 1024
2020-11-25 21:22:02 [INFO ]     checkpoint_interval: 10
2020-11-25 21:22:02 [INFO ]     dataset: newsci
2020-11-25 21:22:02 [INFO ]     dataset_labels: !!python/tuple
2020-11-25 21:22:02 [INFO ]     - cat
2020-11-25 21:22:02 [INFO ]     - dog
2020-11-25 21:22:02 [INFO ]     - face
2020-11-25 21:22:02 [INFO ]     dataset_normalization: !!python/tuple
2020-11-25 21:22:02 [INFO ]     - !!python/tuple
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]     - !!python/tuple
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]         - 0.5
2020-11-25 21:22:02 [INFO ]     dataset_root: ./data/newsci
2020-11-25 21:22:02 [INFO ]     decay_epochs: 40
2020-11-25 21:22:02 [INFO ]     decay_factor: 0.5
2020-11-25 21:22:02 [INFO ]     device_id: 0
2020-11-25 21:22:02 [INFO ]     distill_epochs: 3
2020-11-25 21:22:02 [INFO ]     distill_lr: 0.02
2020-11-25 21:22:02 [INFO ]     distill_steps: 10
2020-11-25 21:22:02 [INFO ]     distilled_images_per_class_per_step: 1
2020-11-25 21:22:02 [INFO ]     distributed: false
2020-11-25 21:22:02 [INFO ]     dropout: false
2020-11-25 21:22:02 [INFO ]     epochs: 400
2020-11-25 21:22:02 [INFO ]     expr_name_format: null
2020-11-25 21:22:02 [INFO ]     image_dpi: 80
2020-11-25 21:22:02 [INFO ]     init: xavier
2020-11-25 21:22:02 [INFO ]     init_param: 1.0
2020-11-25 21:22:02 [INFO ]     input_size: 224
2020-11-25 21:22:02 [INFO ]     log_file: ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/output.log
2020-11-25 21:22:02 [INFO ]     log_interval: 100
2020-11-25 21:22:02 [INFO ]     log_level: INFO
2020-11-25 21:22:02 [INFO ]     lr: 0.01
2020-11-25 21:22:02 [INFO ]     mode: distill_basic
2020-11-25 21:22:02 [INFO ]     model_dir: ./models/
2020-11-25 21:22:02 [INFO ]     model_subdir_format: null
2020-11-25 21:22:02 [INFO ]     n_nets: 1
2020-11-25 21:22:02 [INFO ]     nc: 3
2020-11-25 21:22:02 [INFO ]     no_log: false
2020-11-25 21:22:02 [INFO ]     num_classes: 3
2020-11-25 21:22:02 [INFO ]     num_workers: 8
2020-11-25 21:22:02 [INFO ]     phase: train
2020-11-25 21:22:02 [INFO ]     results_dir: ./results/
2020-11-25 21:22:02 [INFO ]     sample_n_nets: 1
2020-11-25 21:22:02 [INFO ]     source_dataset: null
2020-11-25 21:22:02 [INFO ]     start_time: '2020-11-25 21:22:02'
2020-11-25 21:22:02 [INFO ]     target_class: 1
2020-11-25 21:22:02 [INFO ]     test_batch_size: 1024
2020-11-25 21:22:02 [INFO ]     test_distill_epochs: null
2020-11-25 21:22:02 [INFO ]     test_distilled_images: loaded
2020-11-25 21:22:02 [INFO ]     test_distilled_lrs:
2020-11-25 21:22:02 [INFO ]     - loaded
2020-11-25 21:22:02 [INFO ]     test_n_nets: 1
2020-11-25 21:22:02 [INFO ]     test_n_runs: 1
2020-11-25 21:22:02 [INFO ]     test_name_format: null
2020-11-25 21:22:02 [INFO ]     test_nets_type: unknown_init
2020-11-25 21:22:02 [INFO ]     test_niter: 1
2020-11-25 21:22:02 [INFO ]     test_optimize_n_nets: 20
2020-11-25 21:22:02 [INFO ]     test_optimize_n_runs: null
2020-11-25 21:22:02 [INFO ]     train_nets_type: unknown_init
2020-11-25 21:22:02 [INFO ]     world_rank: 0
2020-11-25 21:22:02 [INFO ]     world_size: 1
2020-11-25 21:22:02 [INFO ]  
/home/ericd/dataset-distillation/base_options.py:422: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  old_yaml = yaml.load(f)  # this is a dict
2020-11-25 21:22:02 [WARNING]  ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/opt.yaml already exists, moved to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/old_opts/opt_2020_11_25__21_12_22.yaml
2020-11-25 21:22:05 [INFO ]  train dataset size:        2308
2020-11-25 21:22:05 [INFO ]  test dataset size:         601
2020-11-25 21:22:05 [INFO ]  datasets built!
2020-11-25 21:22:05 [INFO ]  mode: distill_basic, phase: train
2020-11-25 21:22:05 [INFO ]  Build 1 AlexNet network(s) with [xavier(1.0)] init
2020-11-25 21:22:31 [INFO ]  Build 1 AlexNet network(s) with [xavier(1.0)] init
2020-11-25 21:22:31 [INFO ]  Train 10 steps iterated for 3 epochs
/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
2020-11-25 21:24:25 [INFO ]  Results saved to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/checkpoints/epoch0000/results.pth
2020-11-25 21:24:25 [INFO ]  
2020-11-25 21:24:25 [INFO ]  Begin of epoch 0 :
Begin of epoch 0 (1 unknown_init nets):   0%|                                                                                                   | 0/2 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
2020-11-25 21:25:29 [ERROR]  Fatal error:                                                                                                                             
2020-11-25 21:25:29 [ERROR]  Traceback (most recent call last):
2020-11-25 21:25:29 [ERROR]    File "main.py", line 402, in <module>
2020-11-25 21:25:29 [ERROR]      main(options.get_state())
2020-11-25 21:25:29 [ERROR]    File "main.py", line 131, in main
2020-11-25 21:25:29 [ERROR]      steps = train_distilled_image.distill(state, state.models)
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
2020-11-25 21:25:29 [ERROR]      return Trainer(state, models).train()
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/train_distilled_image.py", line 221, in train
2020-11-25 21:25:29 [ERROR]      evaluate_steps(state, steps, 'Begin of epoch {}'.format(epoch))
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/basics.py", line 288, in evaluate_steps
2020-11-25 21:25:29 [ERROR]      res = _evaluate_steps(test_nets_desc, reset=(state.test_nets_type == 'unknown_init'))
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/basics.py", line 276, in _evaluate_steps
2020-11-25 21:25:29 [ERROR]      params = train_steps_inplace(state, models, steps, params, callback=test_callback)
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/basics.py", line 65, in train_steps_inplace
2020-11-25 21:25:29 [ERROR]      callback(i, params)
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/basics.py", line 268, in test_callback
2020-11-25 21:25:29 [ERROR]      test_loader_iter=test_loader_iter)
2020-11-25 21:25:29 [ERROR]    File "/home/ericd/dataset-distillation/basics.py", line 138, in evaluate_models
2020-11-25 21:25:29 [ERROR]      losses[k] += task_loss(state, output, target, reduction='sum').item()  # sum up batch loss
2020-11-25 21:25:29 [ERROR]  RuntimeError: CUDA error: device-side assert triggered
Begin of epoch 0 (1 unknown_init nets):   0%|                                                                                                   | 0/2 [01:04<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 402, in <module>
    main(options.get_state())
  File "main.py", line 131, in main
    steps = train_distilled_image.distill(state, state.models)
  File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
    return Trainer(state, models).train()
  File "/home/ericd/dataset-distillation/train_distilled_image.py", line 221, in train
    evaluate_steps(state, steps, 'Begin of epoch {}'.format(epoch))
  File "/home/ericd/dataset-distillation/basics.py", line 288, in evaluate_steps
    res = _evaluate_steps(test_nets_desc, reset=(state.test_nets_type == 'unknown_init'))
  File "/home/ericd/dataset-distillation/basics.py", line 276, in _evaluate_steps
    params = train_steps_inplace(state, models, steps, params, callback=test_callback)
  File "/home/ericd/dataset-distillation/basics.py", line 65, in train_steps_inplace
    callback(i, params)
  File "/home/ericd/dataset-distillation/basics.py", line 268, in test_callback
    test_loader_iter=test_loader_iter)
  File "/home/ericd/dataset-distillation/basics.py", line 138, in evaluate_models
    losses[k] += task_loss(state, output, target, reduction='sum').item()  # sum up batch loss
RuntimeError: CUDA error: device-side assert triggered
Begin of epoch 0 (1 unknown_init nets):   0%|

Rubiel1 commented 4 years ago

Never mind it was a .ipynb_checkpoints folder created by jupyter lab.s

Rubiel1 commented 4 years ago

I may answer myself this one but now I get

2020-11-25 22:07:03 [INFO ]  Begin of epoch 0  (1 unknown_init nets) test results: 
2020-11-25 22:07:03 [INFO ]               STEP                   ACCURACY                   LOSS          
2020-11-25 22:07:03 [INFO ]                 before steps         100.0000 ±  nan%            1.0770 ±  nan
2020-11-25 22:07:03 [INFO ]          step 30 (lr=0.0200)         100.0000 ±  nan%            1.0414 ±  nan
2020-11-25 22:07:03 [INFO ]  
2020-11-25 22:07:04 [ERROR]  Fatal error: 
2020-11-25 22:07:04 [ERROR]  Traceback (most recent call last):
2020-11-25 22:07:04 [ERROR]    File "main.py", line 402, in <module>
2020-11-25 22:07:04 [ERROR]      main(options.get_state())
2020-11-25 22:07:04 [ERROR]    File "main.py", line 131, in main
2020-11-25 22:07:04 [ERROR]      steps = train_distilled_image.distill(state, state.models)
2020-11-25 22:07:04 [ERROR]    File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
2020-11-25 22:07:04 [ERROR]      return Trainer(state, models).train()
2020-11-25 22:07:04 [ERROR]    File "/home/ericd/dataset-distillation/train_distilled_image.py", line 246, in train
2020-11-25 22:07:04 [ERROR]      grad_infos.append(self.backward(model, rdata, rlabel, steps, saved))
2020-11-25 22:07:04 [ERROR]    File "/home/ericd/dataset-distillation/train_distilled_image.py", line 118, in backward
2020-11-25 22:07:04 [ERROR]      dw, = torch.autograd.grad(l, (params[-1],))
2020-11-25 22:07:04 [ERROR]    File "/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
2020-11-25 22:07:04 [ERROR]      inputs, allow_unused)
2020-11-25 22:07:04 [ERROR]  RuntimeError: CUDA out of memory. Tried to allocate 218.00 MiB (GPU 0; 14.73 GiB total capacity; 13.88 GiB already allocated; 113.88 MiB free; 13.90 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "main.py", line 402, in <module>
    main(options.get_state())
  File "main.py", line 131, in main
    steps = train_distilled_image.distill(state, state.models)
  File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
    return Trainer(state, models).train()
  File "/home/ericd/dataset-distillation/train_distilled_image.py", line 246, in train
    grad_infos.append(self.backward(model, rdata, rlabel, steps, saved))
  File "/home/ericd/dataset-distillation/train_distilled_image.py", line 118, in backward
    dw, = torch.autograd.grad(l, (params[-1],))
  File "/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
    inputs, allow_unused)
RuntimeError: CUDA out of memory. Tried to allocate 218.00 MiB (GPU 0; 14.73 GiB total capacity; 13.88 GiB already allocated; 113.88 MiB free; 13.90 GiB reserved in total by PyTorch)

ilia10000 commented 4 years ago

Sorry @SsnL , I somehow managed to get mixed up and thought this was my fork of your repo! @Rubiel1 I ran into the out of memory problems too and I think reducing the number of distilled images usually fixed it (e.g. fewer steps, or fewer images per class per step).

Rubiel1 commented 4 years ago

Ok I will read the definitions and play with dimensions. Thanks!

ssnl / dataset-distillation

How to adapt to our own databases/architecture? #34