Closed Rubiel1 closed 4 years ago
I guess I have to work on datasets/init.py and add all the corresponding lines, like (https://github.com/SsnL/dataset-distillation/blob/24e6958cf23bc98de20a7b2714fc76fa7d38b874/datasets/__init__.py#L12), also based on my architecture Ill have to personalize the data transformations.
Yep, so you'll need to modify the datasets/init.py exactly as you described. To add an architecture you'll need to modify networks/networks.py and if you have any layer types that haven't already been covered by existing architectures in the repo then you will also need to update the init_weights function in networks/utils.py. I think that was everything but let me know if you run into any problems and I can try to help debug.
:) @ilia10000 is correct. Feel free to let us know if you run into issues.
So, I have rgb images, I crop them to 224 by 224 and use AlexNet but I got RuntimeError: CUDA error: device-side assert triggered.
python main.py --mode distill_basic --dataset newsci --arch AlexNet
INFO:root:Logging to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/output.log
WARNING:root:Log file already exists, will append
2020-11-25 21:22:02 [INFO ] ======================================== 2020-11-25 21:22:02 ========================================
2020-11-25 21:22:02 [INFO ] Base directory is ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)
2020-11-25 21:22:02 [INFO ] Options:
2020-11-25 21:22:02 [INFO ] arch: AlexNet
2020-11-25 21:22:02 [INFO ] attack_class: 0
2020-11-25 21:22:02 [INFO ] base_seed: 1
2020-11-25 21:22:02 [INFO ] batch_size: 1024
2020-11-25 21:22:02 [INFO ] checkpoint_interval: 10
2020-11-25 21:22:02 [INFO ] dataset: newsci
2020-11-25 21:22:02 [INFO ] dataset_labels: !!python/tuple
2020-11-25 21:22:02 [INFO ] - cat
2020-11-25 21:22:02 [INFO ] - dog
2020-11-25 21:22:02 [INFO ] - face
2020-11-25 21:22:02 [INFO ] dataset_normalization: !!python/tuple
2020-11-25 21:22:02 [INFO ] - !!python/tuple
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] - !!python/tuple
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] - 0.5
2020-11-25 21:22:02 [INFO ] dataset_root: ./data/newsci
2020-11-25 21:22:02 [INFO ] decay_epochs: 40
2020-11-25 21:22:02 [INFO ] decay_factor: 0.5
2020-11-25 21:22:02 [INFO ] device_id: 0
2020-11-25 21:22:02 [INFO ] distill_epochs: 3
2020-11-25 21:22:02 [INFO ] distill_lr: 0.02
2020-11-25 21:22:02 [INFO ] distill_steps: 10
2020-11-25 21:22:02 [INFO ] distilled_images_per_class_per_step: 1
2020-11-25 21:22:02 [INFO ] distributed: false
2020-11-25 21:22:02 [INFO ] dropout: false
2020-11-25 21:22:02 [INFO ] epochs: 400
2020-11-25 21:22:02 [INFO ] expr_name_format: null
2020-11-25 21:22:02 [INFO ] image_dpi: 80
2020-11-25 21:22:02 [INFO ] init: xavier
2020-11-25 21:22:02 [INFO ] init_param: 1.0
2020-11-25 21:22:02 [INFO ] input_size: 224
2020-11-25 21:22:02 [INFO ] log_file: ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/output.log
2020-11-25 21:22:02 [INFO ] log_interval: 100
2020-11-25 21:22:02 [INFO ] log_level: INFO
2020-11-25 21:22:02 [INFO ] lr: 0.01
2020-11-25 21:22:02 [INFO ] mode: distill_basic
2020-11-25 21:22:02 [INFO ] model_dir: ./models/
2020-11-25 21:22:02 [INFO ] model_subdir_format: null
2020-11-25 21:22:02 [INFO ] n_nets: 1
2020-11-25 21:22:02 [INFO ] nc: 3
2020-11-25 21:22:02 [INFO ] no_log: false
2020-11-25 21:22:02 [INFO ] num_classes: 3
2020-11-25 21:22:02 [INFO ] num_workers: 8
2020-11-25 21:22:02 [INFO ] phase: train
2020-11-25 21:22:02 [INFO ] results_dir: ./results/
2020-11-25 21:22:02 [INFO ] sample_n_nets: 1
2020-11-25 21:22:02 [INFO ] source_dataset: null
2020-11-25 21:22:02 [INFO ] start_time: '2020-11-25 21:22:02'
2020-11-25 21:22:02 [INFO ] target_class: 1
2020-11-25 21:22:02 [INFO ] test_batch_size: 1024
2020-11-25 21:22:02 [INFO ] test_distill_epochs: null
2020-11-25 21:22:02 [INFO ] test_distilled_images: loaded
2020-11-25 21:22:02 [INFO ] test_distilled_lrs:
2020-11-25 21:22:02 [INFO ] - loaded
2020-11-25 21:22:02 [INFO ] test_n_nets: 1
2020-11-25 21:22:02 [INFO ] test_n_runs: 1
2020-11-25 21:22:02 [INFO ] test_name_format: null
2020-11-25 21:22:02 [INFO ] test_nets_type: unknown_init
2020-11-25 21:22:02 [INFO ] test_niter: 1
2020-11-25 21:22:02 [INFO ] test_optimize_n_nets: 20
2020-11-25 21:22:02 [INFO ] test_optimize_n_runs: null
2020-11-25 21:22:02 [INFO ] train_nets_type: unknown_init
2020-11-25 21:22:02 [INFO ] world_rank: 0
2020-11-25 21:22:02 [INFO ] world_size: 1
2020-11-25 21:22:02 [INFO ]
/home/ericd/dataset-distillation/base_options.py:422: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
old_yaml = yaml.load(f) # this is a dict
2020-11-25 21:22:02 [WARNING] ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/opt.yaml already exists, moved to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/old_opts/opt_2020_11_25__21_12_22.yaml
2020-11-25 21:22:05 [INFO ] train dataset size: 2308
2020-11-25 21:22:05 [INFO ] test dataset size: 601
2020-11-25 21:22:05 [INFO ] datasets built!
2020-11-25 21:22:05 [INFO ] mode: distill_basic, phase: train
2020-11-25 21:22:05 [INFO ] Build 1 AlexNet network(s) with [xavier(1.0)] init
2020-11-25 21:22:31 [INFO ] Build 1 AlexNet network(s) with [xavier(1.0)] init
2020-11-25 21:22:31 [INFO ] Train 10 steps iterated for 3 epochs
/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
2020-11-25 21:24:25 [INFO ] Results saved to ./results/distill_basic/newsci/arch(AlexNet,xavier,1.0)_distillLR0.02_E(400,40,0.5)_lr0.01_B1x10x3_train(unknown_init)/checkpoints/epoch0000/results.pth
2020-11-25 21:24:25 [INFO ]
2020-11-25 21:24:25 [INFO ] Begin of epoch 0 :
Begin of epoch 0 (1 unknown_init nets): 0%| | 0/2 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
2020-11-25 21:25:29 [ERROR] Fatal error:
2020-11-25 21:25:29 [ERROR] Traceback (most recent call last):
2020-11-25 21:25:29 [ERROR] File "main.py", line 402, in <module>
2020-11-25 21:25:29 [ERROR] main(options.get_state())
2020-11-25 21:25:29 [ERROR] File "main.py", line 131, in main
2020-11-25 21:25:29 [ERROR] steps = train_distilled_image.distill(state, state.models)
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
2020-11-25 21:25:29 [ERROR] return Trainer(state, models).train()
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/train_distilled_image.py", line 221, in train
2020-11-25 21:25:29 [ERROR] evaluate_steps(state, steps, 'Begin of epoch {}'.format(epoch))
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/basics.py", line 288, in evaluate_steps
2020-11-25 21:25:29 [ERROR] res = _evaluate_steps(test_nets_desc, reset=(state.test_nets_type == 'unknown_init'))
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/basics.py", line 276, in _evaluate_steps
2020-11-25 21:25:29 [ERROR] params = train_steps_inplace(state, models, steps, params, callback=test_callback)
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/basics.py", line 65, in train_steps_inplace
2020-11-25 21:25:29 [ERROR] callback(i, params)
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/basics.py", line 268, in test_callback
2020-11-25 21:25:29 [ERROR] test_loader_iter=test_loader_iter)
2020-11-25 21:25:29 [ERROR] File "/home/ericd/dataset-distillation/basics.py", line 138, in evaluate_models
2020-11-25 21:25:29 [ERROR] losses[k] += task_loss(state, output, target, reduction='sum').item() # sum up batch loss
2020-11-25 21:25:29 [ERROR] RuntimeError: CUDA error: device-side assert triggered
Begin of epoch 0 (1 unknown_init nets): 0%| | 0/2 [01:04<?, ?it/s]Traceback (most recent call last):
File "main.py", line 402, in <module>
main(options.get_state())
File "main.py", line 131, in main
steps = train_distilled_image.distill(state, state.models)
File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
return Trainer(state, models).train()
File "/home/ericd/dataset-distillation/train_distilled_image.py", line 221, in train
evaluate_steps(state, steps, 'Begin of epoch {}'.format(epoch))
File "/home/ericd/dataset-distillation/basics.py", line 288, in evaluate_steps
res = _evaluate_steps(test_nets_desc, reset=(state.test_nets_type == 'unknown_init'))
File "/home/ericd/dataset-distillation/basics.py", line 276, in _evaluate_steps
params = train_steps_inplace(state, models, steps, params, callback=test_callback)
File "/home/ericd/dataset-distillation/basics.py", line 65, in train_steps_inplace
callback(i, params)
File "/home/ericd/dataset-distillation/basics.py", line 268, in test_callback
test_loader_iter=test_loader_iter)
File "/home/ericd/dataset-distillation/basics.py", line 138, in evaluate_models
losses[k] += task_loss(state, output, target, reduction='sum').item() # sum up batch loss
RuntimeError: CUDA error: device-side assert triggered
Begin of epoch 0 (1 unknown_init nets): 0%|
Never mind it was a .ipynb_checkpoints folder created by jupyter lab.s
I may answer myself this one but now I get
2020-11-25 22:07:03 [INFO ] Begin of epoch 0 (1 unknown_init nets) test results:
2020-11-25 22:07:03 [INFO ] STEP ACCURACY LOSS
2020-11-25 22:07:03 [INFO ] before steps 100.0000 ± nan% 1.0770 ± nan
2020-11-25 22:07:03 [INFO ] step 30 (lr=0.0200) 100.0000 ± nan% 1.0414 ± nan
2020-11-25 22:07:03 [INFO ]
2020-11-25 22:07:04 [ERROR] Fatal error:
2020-11-25 22:07:04 [ERROR] Traceback (most recent call last):
2020-11-25 22:07:04 [ERROR] File "main.py", line 402, in <module>
2020-11-25 22:07:04 [ERROR] main(options.get_state())
2020-11-25 22:07:04 [ERROR] File "main.py", line 131, in main
2020-11-25 22:07:04 [ERROR] steps = train_distilled_image.distill(state, state.models)
2020-11-25 22:07:04 [ERROR] File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
2020-11-25 22:07:04 [ERROR] return Trainer(state, models).train()
2020-11-25 22:07:04 [ERROR] File "/home/ericd/dataset-distillation/train_distilled_image.py", line 246, in train
2020-11-25 22:07:04 [ERROR] grad_infos.append(self.backward(model, rdata, rlabel, steps, saved))
2020-11-25 22:07:04 [ERROR] File "/home/ericd/dataset-distillation/train_distilled_image.py", line 118, in backward
2020-11-25 22:07:04 [ERROR] dw, = torch.autograd.grad(l, (params[-1],))
2020-11-25 22:07:04 [ERROR] File "/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
2020-11-25 22:07:04 [ERROR] inputs, allow_unused)
2020-11-25 22:07:04 [ERROR] RuntimeError: CUDA out of memory. Tried to allocate 218.00 MiB (GPU 0; 14.73 GiB total capacity; 13.88 GiB already allocated; 113.88 MiB free; 13.90 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "main.py", line 402, in <module>
main(options.get_state())
File "main.py", line 131, in main
steps = train_distilled_image.distill(state, state.models)
File "/home/ericd/dataset-distillation/train_distilled_image.py", line 290, in distill
return Trainer(state, models).train()
File "/home/ericd/dataset-distillation/train_distilled_image.py", line 246, in train
grad_infos.append(self.backward(model, rdata, rlabel, steps, saved))
File "/home/ericd/dataset-distillation/train_distilled_image.py", line 118, in backward
dw, = torch.autograd.grad(l, (params[-1],))
File "/home/ericd/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
inputs, allow_unused)
RuntimeError: CUDA out of memory. Tried to allocate 218.00 MiB (GPU 0; 14.73 GiB total capacity; 13.88 GiB already allocated; 113.88 MiB free; 13.90 GiB reserved in total by PyTorch)
Sorry @SsnL , I somehow managed to get mixed up and thought this was my fork of your repo! @Rubiel1 I ran into the out of memory problems too and I think reducing the number of distilled images usually fixed it (e.g. fewer steps, or fewer images per class per step).
Ok I will read the definitions and play with dimensions. Thanks!
Hey I am very interested in this work, can you easy my job and indicate which lines do I need to customize to distillate my own dataset with Xavier initialization (Random initialization according to your paper) and a particular architecture not on your list?