nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
155 stars 20 forks source link

80% accuracy threshold intepretation #104

Closed Yijun-Tian closed 1 year ago

Yijun-Tian commented 1 year ago

Hello, When training a customized base modification model, is the 80% accuracy threshold arbitrary? Will the model with an accuracy of 79.6% still have some correctness?

marcus1487 commented 1 year ago

I'm not sure to which 80% threshold you are referring. Is this the "breach threshold" found in the training script? If so this is intended to flag runs that have entered a broken state of train (which happens with neural networks and may indicate the need to adjust some hyper parameters). If not can you point to the location of this threshold?

In general, I would say the there are certainly use cases where a model with less than 80% accuracy can provide biological insight if that is the core of the question here.

Yijun-Tian commented 1 year ago

Hi @marcus1487, The threshold notification was found in the log.txt file during the model training:

DEBUG [10:26:26:MainProcess:MainThread:train_model.py:443] 80.0% accuracy threshold surpassed
DEBUG [10:26:26:MainProcess:MainThread:train_model.py:450] Saving best model after 1 epochs with val_acc 0.930271992835942
DEBUG [11:44:42:MainProcess:MainThread:train_model.py:450] Saving best model after 2 epochs with val_acc 0.9331920110958944
DEBUG [13:22:18:MainProcess:MainThread:train_model.py:450] Saving best model after 3 epochs with val_acc 0.9348291102820673
DEBUG [14:40:23:MainProcess:MainThread:train_model.py:450] Saving best model after 4 epochs with val_acc 0.9356327930556212
DEBUG [16:08:15:MainProcess:MainThread:train_model.py:450] Saving best model after 5 epochs with val_acc 0.9358208145967681

In my case, one of my all context 6mA models stops at the val_acc of 0.796, no matter how I try to improve it with multiple options in the prep steps, such as context length, chunk size or refine table. As you have suggested, the 0.796 model does shed some insight into my characterization procedure, so that is good news. But the val_acc lower than 0.8 is not favored since I assumed higher val_acc would mean higher true positive and true negative calls. Could you suggest any options in the training process that I may try to see what happens?

marcus1487 commented 1 year ago

This 80% threshold is quite arbitrary, this was not designed for more complex modified bases. The fact that you are getting this with 6mA (assuming in DNA) is a bit surprising. There are many possibilities that may cause lower accuracy in custom models. Internally we have often found that some point in the data preparation can be the source of lower accuracy. Assessing the confidence in the ground truth labels for Remora training data would be the first item I would consider. Can you share more details on your training data? We have sometimes found marginal gains be increasing the chunks size or other hyper-parameters (which will be much easier in the next release), but these are generally much lower impact compared to a focus on the raw data itself. I hope this helps!

Yijun-Tian commented 1 year ago

Thank you @Marcus @.***>. The train data I used is generated from whole genome amplified DNA, treated with methyl transferase to mimic positive samples. In my case, since I need to detect 6mA and 5mCG modifications, I used one sample treated with both EcoGII and Msssl as a positive sample and one treated with Msssl only as a negative sample. I will watch for the next release for more information about the hyper-parameters specification.

From: Marcus Stoiber @.> Sent: Tuesday, August 29, 2023 1:13 PM To: nanoporetech/remora @.> Cc: Tian, Yijun @.>; Author @.> Subject: EXT:Re: [nanoporetech/remora] 80% accuracy threshold intepretation (Issue #104)

This 80% threshold is quite arbitrary, this was not designed for more complex modified bases. The fact that you are getting this with 6mA (assuming in DNA) is a bit surprising. There are many possibilities that may cause lower accuracy in custom models. Internally we have often found that some point in the data preparation can be the source of lower accuracy. Assessing the confidence in the ground truth labels for Remora training data would be the first item I would consider. Can you share more details on your training data? We have sometimes found marginal gains be increasing the chunks size or other hyper-parameters (which will be much easier in the next release), but these are generally much lower impact compared to a focus on the raw data itself. I hope this helps!

— Reply to this email directly, view it on GitHubhttps://github.com/nanoporetech/remora/issues/104#issuecomment-1697840815, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANRYRSSRMS6SMQD2QLHSHVTXXYPLHANCNFSM6AAAAAA4AKS7VA. You are receiving this because you authored the thread.Message ID: @.**@.>>

This transmission may be confidential or protected from disclosure and is only for review and use by the intended recipient. Access by anyone else is unauthorized. Any unauthorized reader is hereby notified that any review, use, dissemination, disclosure or copying of this information, or any act or omission taken in reliance on it, is prohibited and may be unlawful. If you received this transmission in error, please notify the sender immediately. Thank you.