some questions about code

Wongcheukwai commented 4 years ago

Hi Sunil,

I really really love your work. It's novel, relatively easy and effective. After running your code, I have a few questions:

I can't figure out why the abstained number before learning_epoch is always zero. But the size of outputs on line419 in train_dac is [128, 11] during learning epochs, how come after max it's never 10? After learning epoch(20),max can be 10. That's really interesting.
I saw your previous discussion with @pingqingsheng regarding how to remove all the abstained data. Can you tell me how to get their indices and remove them? Here is the script i used: python train_dac.py --datadir ../dataset --dataset cifar10 --nesterov --net_type resnet --depth 34 -use-gpu --epochs 165 --loss_fn dac_loss --learn_epochs 20 --rand_labels 0.2 -cuda_device 0 --abst_rate 0.2 --save_train_scores.After that, i got 165 .npy files. Which epoch's train_score should I use? And how should I deal with this [50000,11] tensor? Max?
how did you get the parenthetical numbers(especially the remaining noise level) in Table 1. For example, in cifar 10 80% sym noise, you claimed the remaining noise level is just 0.16, but after using my step 2 to remove the noise data, the correct rate for the left data(supposed to be clean) is just 0.28, making the remaining noise level really high. I am really confused by that. Is there something wrong with my step 2(how to remove noise data)?
do you know any other novel approach to distinguish noise and clean data? I tried everything and found GMM is almost the best.

thulas commented 4 years ago

Hi Sunil,

I really really love your work. It's novel, relatively easy and effective. After running your code, I have a few questions:

I can't figure out why the abstained number before learning_epoch is always zero. But the size of outputs on line419 in train_dac is [128, 11] during learning epochs, how come after max it's never 10? After learning epoch(20),max can be 10. That's really interesting.

This is correct behavior. Learn epochs is a warm-up phase, where we train with regular cross-entropy. Abstention loss only kicks in after this, i.e. for all epochs > learn_epochs we use abstention loss. See https://github.com/thulas/dac-label-noise/blob/678e468464daf483410480f87870226330ccd241/dac_loss.py#L52-L54

Since the ground-truth labels don’t have the abstention class, the argmax of the output during the warmup phase is usually never the abstention class (because of the way cross-entropy works).

Also, learn_epochs is a hyperparameter, which we set to 20 in all our experiments.

I saw your previous discussion with @pingqingsheng regarding how to remove all the abstained data. Can you tell me how to get their indices and remove them? Here is the script i used: python train_dac.py --datadir ../dataset --dataset cifar10 --nesterov --net_type resnet --depth 34 -use-gpu --epochs 165 --loss_fn dac_loss --learn_epochs 20 --rand_labels 0.2 -cuda_device 0 --abst_rate 0.2 --save_train_scores.After that, i got 165 .npy files. Which epoch's train_score should I use? And how should I deal with this [50000,11] tensor? Max?

You should eliminate all the data points for which argmax was 10 (i.e abstention class) at a selected epoch. Epoch 165, i.e final epoch might itself be a good choice, but it's difficult to say what the best epoch should be as this depends, among other things, on the actual noise rate which is not known in advance.

how did you get the parenthetical numbers(especially the remaining noise level) in Table 1. For example, in cifar 10 80% sym noise, you claimed the remaining noise level is just 0.16, but after using my step 2 to remove the noise data, the correct rate for the left data(supposed to be clean) is just 0.28, making the remaining noise level really high. I am really confused by that. Is there something wrong with my step 2(how to remove noise data)?

See above. Remove abstained points, and retrain (with regular cross-entropy) on cleaner set. But also see discussion here about a few important details: https://github.com/thulas/dac-label-noise/issues/1#issuecomment-531067297

do you know any other novel approach to distinguish noise and clean data? I tried everything and found GMM is almost the best.

GMM and other mixture models (like Beta Mixture) work well when the noise model is symmetric, but real world noise is seldom symmetric, and usually feature dependent. In this case, we find the DAC performs especially well. See Section 3 in our ICML paper for details.

Wongcheukwai commented 4 years ago

thank you for your detailed reply. Can you tell me how to run asymmetric cifar10? which args. should i set?

thulas commented 4 years ago

Use the label_flip.py script inside the utils directory to generate a class dependent label corruption. See Appendix C in our ICML paper (https://arxiv.org/pdf/1905.10964.pdf) for additional details. Once you generate the flipped labels, re-run the experiment using the --label_noise_info argument, as in: python train_dac.py --label_noise_info <flipped_label_pickle_file> [other arguments as before]

thulas / dac-label-noise

some questions about code #5