Trying to reproduce results (Densenet121)

I'm trying to recreate the results that you achieved on the Chexpert database.
I ran the following command line 3 times (I hard-coded the datapath, but didn't change any other defaults): $ python chexpert.py --train --model densenet121 --pretrained --cuda 0

and got the following results.

AUC:					Loss
0	1	2	3	4	0	1	2	3	4
0.8055	0.8198	0.9071	0.9151	0.9435	0.5102	0.5803	0.3080	0.2988	0.3024
0.8035	0.8007	0.8774	0.9286	0.9277	0.5155	0.6254	0.3222	0.2837	0.3206
0.8076	0.7927	0.8788	0.9193	0.9407	0.5195	0.6134	0.3152	0.2882	0.3107

To do ensemble evaluation, I then ran a command like this, for each of the 3 runs: $ python chexpert.py --evaluate_ensemble --model densenet121 --output_dir ./results --restore ./results/2020-02-13_02-37-23/ --plot_roc

and got the following (the 3 results at the bottom are from the README.md file, for ease of comparison)

	AUC:
	0	1	2	3	4	0	1	2	3	4
d121_pre	0.8191	0.8003	0.8821	0.8943	0.9281	0.5117	0.5911	0.3207	0.3241	0.3149
d121_pre	0.7784	0.8580	0.8902	0.9093	0.9289	0.5315	0.5610	0.3174	0.3376	0.3123
d121_pre	0.8077	0.7580	0.9068	0.9228	0.9432	0.5125	0.5728	0.3094	0.2886	0.2886

Model	Atel.	Cardo.	Consol	Edema	PlEff
d121_base	0.8470	0.8450	0.9120	0.9050	0.9380
d121_pre	0.8470	0.8590	0.9000	0.9360	0.9400
d121_ataug	0.8530	0.8380	0.9150	0.8690	0.9130

My ensemble evaluation results generally are lower than those on the README.md page, for densenet121_pretrained. Or at least, the variation in the results seem high. For instance, I never got close to 0.85 for Atelectasis.

Did you use any special non-default parameters for training? Did you report the results of a single training run, or the best of multiple training runs, e.g.? Did you also see high variation in different training/evaluation runs?

Hi - Sorry for the late reply.

I have saved the eval results json files but have only the best checkpoints on my computer for space reasons, so cannot rerun the ensemble results now.

To your questions -- I don't recall running non-default params on the densenet baseline and pretrained models. I looked at the config.json and it matches the default args. I only reported the results of a single run (fixed seed) and ensemble over the best 10 checkpoints along the training. I do remember seeing high variation and being far from the original Stanford results which could be data augmentation and preprocessing, ensembling over seeds, batch generation and within batch label distribution, etc.

I am attaching the json files for the baseline (bl) and pretrained densenet121 models - specifically checkpoint tracker and the eval results over training. You can compare the checkpoint tracker loss and avg auc against your results and see if ensembling is using very different underlying checkpoints. You can also compare the eval results over the course of training against your results.

Sorry I am not able to diagnose this better, it's been a while and I ran a ton of models over it so it's hard to remember the variations.

Best Kamen

densenet121_baseline_and_pretrained.zip

On Thu, Feb 13, 2020 at 9:35 AM drcdr notifications@github.com wrote:

I'm trying to recreate the results that you achieved on the Chexpert database. I ran the following command line 3 times (I hard-coded the datapath, but didn't change any other defaults): $ python chexpert.py --train --model densenet121 --pretrained --cuda 0

and got the following results. AUC: Loss 0 1 2 3 4 0 1 2 3 4 0.8055 0.8198 0.9071 0.9151 0.9435 0.5102 0.5803 0.3080 0.2988 0.3024 0.8035 0.8007 0.8774 0.9286 0.9277 0.5155 0.6254 0.3222 0.2837 0.3206 0.8076 0.7927 0.8788 0.9193 0.9407 0.5195 0.6134 0.3152 0.2882 0.3107

To do ensemble evaluation, I then ran a command like this, for each of the 3 runs: $ python chexpert.py --evaluate_ensemble --model densenet121 --output_dir ./results --restore ./results/2020-02-13_02-37-23/ --plot_roc

and got the following (the 3 results at the bottom are from the README.md file, for ease of comparison) AUC: 0 1 2 3 4 0 1 2 3 4 d121_pre 0.8191 0.8003 0.8821 0.8943 0.9281 0.5117 0.5911 0.3207 0.3241 0.3149 d121_pre 0.7784 0.8580 0.8902 0.9093 0.9289 0.5315 0.5610 0.3174 0.3376 0.3123 d121_pre 0.8077 0.7580 0.9068 0.9228 0.9432 0.5125 0.5728 0.3094 0.2886 0.2886

Model Atel. Cardo. Consol Edema PlEff d121_base 0.8470 0.8450 0.9120 0.9050 0.9380 d121_pre 0.8470 0.8590 0.9000 0.9360 0.9400 d121_ataug 0.8530 0.8380 0.9150 0.8690 0.9130

My ensemble evaluation results generally are lower than those on the README.md page, for densenet121_pretrained. Or at least, the variation in the results seem high. For instance, I never got close to 0.85 for Atelectasis.

Did you use any special non-default parameters for training? Did you report the results of a single training run, or the best of multiple training runs, e.g.? Did you also see high variation in different training/evaluation runs?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kamenbliznashki/chexpert/issues/1?email_source=notifications&email_token=AG3JPP7KQJ2LQQ2BEAI6ZSDRCVLETA5CNFSM4KUT4K5KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4INJAWLA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3JPP7CIK3FTI62TWY6BALRCVLETANCNFSM4KUT4K5A .

kamenbliznashki / chexpert

Trying to reproduce results (Densenet121) #1