Clarification on evaluation

Sazan-Mahbub commented 9 months ago

Hi,

Great work!

I am trying to reproduce the SS prediction results (attached image) for ArchiveII600 (3911 sequences) and TS0 (1305 sequences).

While I could exactly reproduce UFold's scores, I could not reproduce RNAFM's scores in the same way. I used the model weights for RNAFM from here.

I got the F1 score 0.666 for TS0, using "RNA-FM-ResNet_bpRNA.pth"; the paper reported 0.704. For ArchiveII600, I got 0.933 using "RNA-FM-ResNet_RNAStralign.pth"; the paper reported 0.941.

I was wondering if the evaluation in your paper was done differently than how UFold did it

I'd really appreciate any help. Thank you!

mydkzgj commented 8 months ago

Hi, @Sazan-Mahbub. Thank you for your interest in our work. Actually, the performances reported in this section of our preprint correspond to our fine-tuning models, whose backbone (RNA-FM) were trained together in the downstream task. However, we later updated them with our feature-based models where RNA-FMs were frozen during the training, leading to the degrading performances here. I have checked that your results are similar to but a little bit lower than ours, I think it may be caused by different threshold selection. Therefore, no worry about your metric computation, it is correct.

Sazan-Mahbub commented 8 months ago

Hi, @Sazan-Mahbub. Thank you for your interest in our work. Actually, the performances reported in this section of our preprint correspond to our fine-tuning models, whose backbone (RNA-FM) were trained together in the downstream task. However, we later updated them with our feature-based models where RNA-FMs were frozen during the training, leading to the degrading performances here. I have checked that your results are similar to but a little bit lower than ours, I think it may be caused by different threshold selection. Therefore, no worry about your metric computation, it is correct.

Hi @mydkzgj,

Thank you for your reply and clarifications! This has been really helpful for us.

Using sigmoid on the output and thresholding at 0.5, I am now getting F1=0.672 for TS0 and F1=0.934 for ArchiveII600 (with the same backbones mentioned before). I hope these are closer to the actual ones.

mydkzgj commented 8 months ago

Yeah, they're pretty much the same.

ml4bio / RNA-FM

Clarification on evaluation #10