nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
156 stars 20 forks source link

Several issues on remora usage #2

Closed PanZiwei closed 2 years ago

PanZiwei commented 2 years ago

Hi, Thanks for this amazing tool. I have several questions and would really appreciate it if you can help.

  1. What's the difference between the pre-trained models dna_r9.4.1_e8 and dna_r9.4.1_e8.1?

  2. What's the relationship of the remora pre-trained models and models in rerio repo? In the latest Megalodon it seems that Megalodon will call the remora model, but in the previous one Megalodon is using rerio model. A little confused here.

  3. Is remora independent from Megalodon and Taiyaki? Will remora replace Megalodon somehow in the future? Can you provide more information on its usages?

  4. Is remora a new methylation calling tool or not? If so, how can I use the remora to call methylations? Any plan on a detailed tutorial like Megalodon?

  5. Do you have any plan to release the training datasets for remora shown on NCM2021?

  6. It seems that in default remora ont-pyguppy-client-lib==5.1.9, however, the latest version of Guppy is only 5.0.16 in the community. Is there a delay for the Guppy release? Or is it possible to use remora on the older version of Guppy?

Thank you so much for your help!

Best, Ziwei

marcus1487 commented 2 years ago

These are all really great questions.

  1. e8 and e8.1 refer to the motor enzyme (part of the sequencing kit). These are equivalents to kit10 and kit12. These exact strings match the basecalling model names in bonito.

  2. I completely appreciate this confusion. Remora is a bit of a paradigm shift in nanopore modbase calling. Previously Megalodon/Rerio modbase models were integrated into flip flop models. This caused issues both in training and degraded canonical basecalling performance. Remora models are now separated from the basecalling models. As mentioned in my 2021 NCM talk, this has a number of benefits.

Practically this means that Remora models can be applied on top of basecalling models. Remora models are also much smaller than the basecalling models so Remora models are included in with the remora repository. When megalodon calls Remora, the specified remora model is loaded and used to call modified bases on top of the selected canonical basecalling model. Hopefully this sheds some light on the role Remora plays in the new software stack, but please reach out with any further questions.

  1. Yes, remora is (at its core) independent of megalodon and taiyaki. Megalodon and taiyaki are used to prepare training data for remora, but this may be swapped out at a later date. The Remora API example in the README shows how the core Remora functionality works (with a pretrained model). This is the interface that megalodon and bonito will use to call modified bases going forward.

Remora will not replace megalodon, but will replace the core Nanopore modbase calling software stack. Remora will be called from core basecalling software (megalodon, bonito and soon guppy).

  1. Moving my forward, remora will be the core modified base calling logic/API. Megalodon and remora (and guppy) will use the remora framework to produce modified base predictions. Remora will not be a standalone modified base caller as it relies on the outputs of the basecall processing step.

  2. We do not have immediate plans to release the training datasets presented at NCM 2021, but will look into this in the medium term.

  3. I’m not sure where this pyguppy version pin is coming from, but this is not part of the remora repo to my knowledge. The first release of remora models were trained against guppy 5.0.16, but future remora model versions will be pinned to released bonito model versions. We hope to unify the basecalling model versioning across ONT software to avoid such confusions in the future.

PanZiwei commented 2 years ago

Hi, I tried to clarify the answers based on your reply. Correct me if the following statements are wrong.

  1. Bonito is served as an individual base caller under development apart from Guppy. So for the base calling step, I can either use default Guppy or Bonito (if accuracy matters more).

  2. In the older version, Megalodon calls a single model to do the base calling and methylation calling step at the same time though theoretically these two steps are separated from each other in the backend. In the latest one, megalodon will first call the base calling model to get an intermediate result, then call the modified bases model to utilize the outputs of the baseballing processing step to produce the final methylation calling result.

  3. Megalodon and taiyaki are in need of data processing only if I want to use remora to train a new model. I can use remora alone with the pre-trained model to do methylation calling via logic/API. Is there any reason that the model is saved in .onnx format?

  4. The logic of remora is somehow similar to megalodon because both of them utilized the outputs and neural network after the base calling step. So the whole architecture is: Input: nanopore fast5 files -> base calling framework by Guppy or Bonito -> methylation calling framework by Megalodon or remora -> Output: base information + methylation calling results Is it possible to extract the intermediate results or framework after base calling? If so, how? Right now it seems that only the final output is available.

  5. Would really appreciate it if the data can be shared in the near future for community usage.

  6. I am trying to install remora and megalodon together in a virtual environment, maybe that's the reason of incompatibility of pyguppy version since the current ont-pyguppy-client-lib is updated to 5.1.9 (https://pypi.org/project/ont-pyguppy-client-lib/)

Thank you so much for your help!

PanZiwei commented 2 years ago

Also, do you have an idea about the running speed and resource for remora since it is claimed that remora a small model? I tried Megalodon v2.4.0 with remora 0.1.1 with GPU node and after 6h it is still ongoing. For comparison, I used Megalodon v2.3.4 before on the same datasets and it only took ~2h to get the results. And I got the error below(but the job is not paused because of the error):

[08:59:29] Running Megalodon version 2.4.0
[08:59:29] Loading guppy basecalling backend
[08:59:34] CRF models are not fully supported.

I also checked the guppy log fille and it seems that Guppy is stuck with the disconnect issue? I didn't encounter the issue before with older version. Not sure the issue is coming from Guppy side or remora side.

2021-12-03 08:59:30.733174 [guppy/info] crashpad_handler not supported on this platform.
2021-12-03 08:59:30.733940 [guppy/info] Listening on port 44018.
2021-12-03 08:59:30.971271 [guppy/info] CUDA device 0 (compute 6.0) initialised, memory limit 17071734784B (16804216832B free)
2021-12-03 08:59:34.610117 [guppy/message]
Config loaded:
config file:               data/dna_r9.4.1_450bps_hac.cfg
model file:                data/template_r9.4.1_450bps_hac.jsn
model version id           2021-05-17_dna_r9.4.1_minion_384_d37a2ab9
adapter scaler model file: None
2021-12-03 08:59:34.614397 [guppy/message] Starting server on port: 44018
2021-12-03 08:59:34.679019 [guppy/info] client connection request. ["dna_r9.4.1_450bps_hac:>timeout_interval=15000>client_name=>alignment_type=auto:::"]
2021-12-03 08:59:34.679228 [guppy/info] New client connected Client 1 anonymous_client_1 id: 3a06644b-453a-421c-9bb6-d86bda1899b5 (connection string = 'dna_r9.4.1_450bps_hac:>timeout_interval=15000>client_name=>alignment_type=auto:::').
2021-12-03 08:59:34.831218 [guppy/info] Client 1 anonymous_client_1 id: 3a06644b-453a-421c-9bb6-d86bda1899b5 has disconnected.

Thank you so much for your help!

marcus1487 commented 2 years ago

Responses to comments here:

Bonito is served as an individual base caller under development apart from Guppy. So for the base calling step, I can either use default Guppy or Bonito (if accuracy matters more).

Bonito is currently the research basecaller, so updates to models and code will be pushed to bonito more quickly while Guppy with maintain production basecalling capabilities. Currently Remora is only runnable from within Bonito (and megalodon). Guppy support is coming in January.

though theoretically these two steps are separated from each other in the backend

The previous flip-flop modbase models used in megalodon did actually call canonical and modified bases at the same time. This framework did not allow for the separation of these two tasks. This was part of the motivation for the shift to the Remora framework for modified bases.

I can use remora alone with the pre-trained model to do methylation calling via logic/API

Yes. Remora can be used independently from Taiyaki and Megalodon with a pre-trained model.

Is there any reason that the model is saved in .onnx format?

ONNX provided a reasonable format for the transfer and execution of Remora models and allows for the packaging of metadata along with the model. Alternatives were considered and there was no perfect solution (pytorch checkpoints for example would require separate model releases for CPU and GPU models for example which is not suitable).

The logic of remora is somehow similar to megalodon because both of them utilized the outputs and neural network after the base calling step

This is not quite true. Remora only uses the signal+basecalls+move table. Megalodon with flip-flop modbase models used the neural network output directly.

Is it possible to extract the intermediate results or framework after base calling? If so, how?

It is possible to use the pyguppy API for this purpose. The intermediate output is too large to store on disk, so this option is not available. See docs here https://pypi.org/project/ont-pyguppy-client-lib/

Would really appreciate it if the data can be shared in the near future for community usage.

This is under consideration internally.

For comparison, I used Megalodon v2.3.4 before on the same datasets and it only took ~2h to get the results.

I would suggest running megalodon with the same guppy model for a comparison of Remora impact (will be larger impact using fast models, but should be minimal for sup models). The basecalling models have been updated within guppy, so a comparison with an older guppy version is not applicable.

PanZiwei commented 2 years ago

@marcus1487 Sorry for the following questions relevant to the model performance since I said your reply in https://github.com/nanoporetech/megalodon/issues/239#issuecomment-1023427379

  1. How do you define the remora model result is correct or not at this moment since the actual ground is a mixed state?
  2. What cutoff did you use to get the per-site methylation level from per-read prediction? I remembered the default cut-off in megalodon alone is 0.8, is the parameter applicable to all remora models also?
marcus1487 commented 2 years ago

We use several different samples to verify the results of each Remora model. These samples include native data (generally matched to bisulfite or other technique), oligonucleotides with modified bases and in vitro modified PCR samples. We are working on making these validation tasks easier to replicated within Remora.

The 20%/80% cut off seems to work quite well for Remora models as with the previous flip-flop modified base models (run from megalodon).