Can't reproduce results reported on paper

caiopetruccirosa commented 1 month ago

Hello! First of all, thank you very much for the paper. I found it highly interesting and relevant for the future of this field, especially since there aren't many standardized benchmarks for age estimation today.

I have been trying to reproduce the results reported in Table 6 of the paper, but I can't achieve a similar performance. I ran the 5 benchmarks defined in the facebase/configs directory, and my results are as follows:

AgeDB_256x256: 10.74 MAE on AgeDB test splits, when the reported result is 7.20 MAE.
CACD2000_256x256: 8.43 MAE on CACD2000 test splits, when the reported result is 4.59 MAE.
CLAP2016_256x256: 11.08 MAE on CLAP2016 test splits, when the reported result is 5.96 MAE.
MORPH_256x256: 7.04 MAE on MORPH test splits, when the reported result is 2.96 MAE.
UTKFace_256x256: 11.90 MAE on UTKFace test splits, when the reported result is 4.75 MAE.

Obs 1: I am assuming that each benchmark defined in the facebase/configs directory corresponds to the "Cross-Entropy" sub-row and "pre-trained on ImageNet" sub-column of Table 6.

Obs 2: I am running the environment inside a Docker container, and kept almost everything as described inside the environment.yaml file.

You can find all the evaluation reports for the benchmarks in this Google Drive folder. Also, to ensure reproducibility, I’ve been working with a fork I made.

From my analysis of the train/validation loss curves and test results, it seems that the models, defined in the config, are overfitting across all benchmarks. This led me to think that some hyperparameters, such as the learning rate, might not be correct. I created and ran an alternative benchmark (facebase/configs/MORPH_256x256_lr1em4), in which I changed the learning rate from 1e-3 to 1e-4. This adjustment slightly improved the results, with the MAE on the MORPH test split going from 7.04 to 5.87, but it’s still far from the reported value.

@paplhjak, could you please provide any insights or suggestions to help resolve this?

Thanks in advance!

paplhjak commented 1 month ago

That is strange. I will re-run some of the experiments and get back to you with what I get.

Could you provide training and validation error graphs over the course of the training?

caiopetruccirosa commented 1 month ago

Thank you very much!

Here are the training graphs from WandB that I obtained by running the MORPH benchmark:

Screenshot 2024-09-20 at 14 53 16 Screenshot 2024-09-20 at 14 54 10

I will try to export all data from WandB and put it in a Google Drive folder, so that I can share it with you here.

paplhjak commented 1 month ago

I have some more pressing things to work on this week, but if all goes well, you can expect me to post my results here next week.

caiopetruccirosa commented 1 month ago

Ok! Thanks a lot. Last Friday I saw that the Google Drive folder I shared here was empty somehow. I put all evaluation reports there now.

paplhjak commented 1 month ago

The provided configuration files serve as an example of how to use the repository but are not the ones for which we report results in the paper.

I apologize, this should have been caught before. Thank you for noticing this.

Once all the experiments are finished, I will upload the training runs here and provide the correct configuration files.

caiopetruccirosa commented 1 month ago

Ok. Thanks a lot for looking into this!

I am looking forward to running further experiments with these new configuration files. :)

paplhjak commented 1 month ago

Please, see the attached evaluation of the experiments. The file does not include the trained models, GitHub enforces a limit on attachment size. I will update the repository to include the correct configuration files and add a link to download the weights of a pretrained model on IMDB wiki.

evaluation.zip configs.zip

AFAD imagenet:

Reproduced: 3.16 +- (0.03)
Reported: 3.17

AgeDB imagenet:

Reproduced: 7.06 +- (0.23)
Reported: 7.20

ChaLearn random:

Reproduced: 9.68
Reported: 8.73

ChaLearn imagenet:

Reproduced: 6.12
Reported: 5.96

ChaLearn pretrained (resnet50 pretrained on IMDB):

Reproduced: 4.53
Reported: 4.49

MORPH imagenet:

Reproduced: 2.96 +- (0.05)
Reported: 2.96

UTKFace imagenet:

Reproduced: 4.72 +- (0.09)
Reported: 4.75

paplhjak commented 1 month ago

AFAD:

AgeDB:

CLAP2016:

MORPH:

UTKFace:

caiopetruccirosa commented 1 month ago

Thank you very much!

I will try to reproduce all these results now. If anything pops up, I will make a comment here.

caiopetruccirosa commented 2 weeks ago

Hi @paplhjak,

I successfully reproduced the results for the "ResNet-50 with Cross-Entropy" method on some of the benchmarks. Thank you so much for updating the repository! :)

I'm now working on reproducing the results for the other methods evaluated in the paper. However, I still need the additional config files I requested in issue #21. Could you please provide them? It would be greatly appreciated!

The results I obtained for the UTKFace, MORPH, and CLAP2016 benchmarks, using these datasets as training data, are as follows:

UTKFace

	Random	ImageNet	IMDB
UTKFace	5.33(0.16)	4.77(0.12)	4.37(0.03)
MORPH	7.22(0.45)	6.74(0.43)	5.02(0.11)
CLAP2016	7.52(0.26)	7.80(0.41)	4.74(0.16)
AgeDB	9.31(0.20)	9.00(0.20)	6.57(0.08)
CACD2000	8.85(0.26)	9.47(0.21)	6.52(0.09)
AFAD	6.38(0.27)	6.89(0.35)	5.43(0.13)
FG-NET	7.71(0.98)	6.63(0.50)	4.94(0.19)

MORPH

	Random	ImageNet	IMDB
MORPH	3.02(0.05)	2.97(0.06)	2.81(0.02)
CLAP2016	10.53(0.15)	9.07(0.34)	6.89(0.12)
AgeDB	12.71(0.15)	11.89(0.21)	9.61(0.29)
CACD2000	10.10(0.34)	11.17(0.40)	8.60(0.34)
AFAD	9.79(0.94)	7.79(0.69)	6.64(0.31)
UTKFace	12.07(0.29)	10.85(0.41)	8.95(0.08)
FG-NET	15.47(0.95)	11.35(0.40)	9.45(0.38)

CLAP2016

	Random	ImageNet	IMDB
CLAP2016	8.30(0.00)	6.36(0.00)	4.51(0.00)
MORPH	7.52(0.00)	6.40(0.00)	4.99(0.00)
AgeDB	12.04(0.00)	11.02(0.00)	7.50(0.00)
CACD2000	9.93(0.00)	8.69(0.00)	6.80(0.00)
AFAD	6.24(0.00)	7.15(0.00)	5.89(0.00)
UTKFace	8.51(0.00)	7.50(0.00)	5.91(0.00)
FG-NET	10.18(0.00)	8.49(0.00)	5.48(0.00)

paplhjak commented 2 weeks ago

Hi @caiopetruccirosa, I will add them today / tommorow.

Most of them just amount to changing the configuration file to contain 'type' specification for the head.

E.g. :

heads:
  - tag: "age"
    type: "dldl"
    attribute: "age"
    ...

The supported types are: 'classification', 'dldl', 'dldl_v2', 'unimodal_concentrated', 'soft_labels', 'mean_variance', 'regression', 'megaage', 'orcnn', 'extended_binary_classification', 'coral'.

paplhjak commented 2 weeks ago

Please, check out #22

caiopetruccirosa commented 2 weeks ago

Just checked #22. Thank you very much!

I will try to reproduce the rest of the results and if anything pops up, I will open another issue :)

paplhjak / Facial-Age-Estimation-Benchmark