Problems replicating RRG

Hello!

I am studying radiology report generation technologies and find this repository fascinating. I have been reading carefully the paper "Improving the Factual Correctness of Radiology Report Generation with Semantic Reward". So I have been trying to replicate the results for a few months now, but the RL training fails to improve the radgraph and bertscore values as much as in the paper or the VilMedic documentation. After 30 epochs, the highest values obtained are: RougeL value = tensor(0.2585) f1cXb value = tensor(0.5800) bertscore value = tensor(0.5421) radgraph value = tensor(0.2658)

So I think the problem is in the RL training. The workflow I follow is proposed in the paper and the one you recommend in the documentation. First, I train with NLL, and I get the following metrics after 30 epochs: RougeL value = tensor(0.2504) f1cXb value = tensor(0.5442)

Then I start with the RL training, where the metrics start with the following values: bertscore value = tensor(0.5181) radgraph value = tensor(0.2337)

And they improve until they reach: RougeL value = tensor(0.2585) f1cXb value = tensor(0.5800) bertscore value = tensor(0.5421) radgraph value = tensor(0.2658)

I have not been able to get better results. It is worth mentioning that these values are on the TEST SET. And I would like to know where I may be failing to achieve the metrics of your final model: RougeL value = 26.5 f1cXb value = 62.2 bertscore value = 58.5 radgraph value = 34.7

Any advice on what I might be doing wrong? It is worth mentioning that I do not use data augmentation and that in the RL training, the learning rate that you propose of 5e-5 does not work well for me. Because of this, I have used a value of 5e-6.

I would be very grateful for your help.

Best regards, and thank you very much,

Daniel

jbdel / vilmedic

Problems replicating RRG #18