nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
156 stars 20 forks source link

model training with one sample #96

Closed kenneditodd closed 1 year ago

kenneditodd commented 1 year ago

Hello,

I want to detect RNA modifications (R9.4.1_70bps) for a single sample. From the data preparation training documentation the control sample and treatment datasets are merged. Does model training require at least two samples? I only have one sample as this is our test/pilot run.

marcus1487 commented 1 year ago

In order to train a model, Remora needs a dataset with at least two labels. It is possible to extract canonical and modified training chunks from the same sample with two runs of the remora dataset prepare command. This would most likely be conducted by specifying two separate BED files with locations of the canonical and modified bases in the reference. If this does not fit your training sample, then please provide more information about your sample and we can help guide your efforts.

I would note that training from a single sample is not likely to produce a robust training set. We are actually working on a major upgrade in remora 3.0 to make much larger and more diverse training sets possible. I would warn against assuming that results on a smaller sample will translate to a scaled up experiment. Happy to elaborate further if you have specific questions here.

kenneditodd commented 1 year ago

@marcus1487 This was very helpful. Thank you for this information. We are going to run more test samples so we aren't limited by tools we hope to use.

  1. I see you need two labels when training a model, control and treatment. How does including a treatment sample affect training data? Is it beneficial to include the treatment sample so you aren't biasing modification detection over a control when you actually test data? I'm wondering why one wouldn't favor using all control samples.

  2. When I actually go to use the model, should I make sure my test data is not a part of my training data?

marcus1487 commented 1 year ago

I'm not sure I am understanding your questions. Remora models predict whether a base in the canonical sequence is a modified alternative or a canonical base. Thus the training data must have examples of canonical and modified locations in order to train the model. All control labels would not have anything to predict.

In terms of samples, you'd have to give a bit more background in order to decide whether one sample would be sufficient, but in general we use many samples to train very robust models. For very specific tasks you may be ok training from a more limited set of samples.

In terms of validation this is up to the target application, but it is highly recommended to test on held out data. Remora does this as part of the training procedure by holding a set fraction of the training data out and running this after each epoch of training. It is even better to test on an orthogonal data type. We generally synthetically print oligonucleotides with know canonical sequence and modified base locations for validation and never include these samples in training. Native samples are often the final target, but generally have poorer ground truth. Thus reasonable results on biological samples (e.g. low modified base content in a control sample) are also a good benchmark.

marcus1487 commented 1 year ago

Hopefully the information provided has help resolve this issue. If you have further questions please re-open this issue.