Closed PanZiwei closed 2 years ago
These are all really great questions.
e8 and e8.1 refer to the motor enzyme (part of the sequencing kit). These are equivalents to kit10 and kit12. These exact strings match the basecalling model names in bonito.
I completely appreciate this confusion. Remora is a bit of a paradigm shift in nanopore modbase calling. Previously Megalodon/Rerio modbase models were integrated into flip flop models. This caused issues both in training and degraded canonical basecalling performance. Remora models are now separated from the basecalling models. As mentioned in my 2021 NCM talk, this has a number of benefits.
Practically this means that Remora models can be applied on top of basecalling models. Remora models are also much smaller than the basecalling models so Remora models are included in with the remora repository. When megalodon calls Remora, the specified remora model is loaded and used to call modified bases on top of the selected canonical basecalling model. Hopefully this sheds some light on the role Remora plays in the new software stack, but please reach out with any further questions.
Remora will not replace megalodon, but will replace the core Nanopore modbase calling software stack. Remora will be called from core basecalling software (megalodon, bonito and soon guppy).
Moving my forward, remora will be the core modified base calling logic/API. Megalodon and remora (and guppy) will use the remora framework to produce modified base predictions. Remora will not be a standalone modified base caller as it relies on the outputs of the basecall processing step.
We do not have immediate plans to release the training datasets presented at NCM 2021, but will look into this in the medium term.
I’m not sure where this pyguppy version pin is coming from, but this is not part of the remora repo to my knowledge. The first release of remora models were trained against guppy 5.0.16, but future remora model versions will be pinned to released bonito model versions. We hope to unify the basecalling model versioning across ONT software to avoid such confusions in the future.
Hi, I tried to clarify the answers based on your reply. Correct me if the following statements are wrong.
Bonito is served as an individual base caller under development apart from Guppy. So for the base calling step, I can either use default Guppy or Bonito (if accuracy matters more).
In the older version, Megalodon calls a single model to do the base calling and methylation calling step at the same time though theoretically these two steps are separated from each other in the backend. In the latest one, megalodon will first call the base calling model to get an intermediate result, then call the modified bases model to utilize the outputs of the baseballing processing step to produce the final methylation calling result.
Megalodon and taiyaki are in need of data processing only if I want to use remora to train a new model. I can use remora alone with the pre-trained model to do methylation calling via logic/API. Is there any reason that the model is saved in .onnx
format?
The logic of remora is somehow similar to megalodon because both of them utilized the outputs and neural network after the base calling step. So the whole architecture is: Input: nanopore fast5 files -> base calling framework by Guppy or Bonito -> methylation calling framework by Megalodon or remora -> Output: base information + methylation calling results Is it possible to extract the intermediate results or framework after base calling? If so, how? Right now it seems that only the final output is available.
Would really appreciate it if the data can be shared in the near future for community usage.
I am trying to install remora and megalodon together in a virtual environment, maybe that's the reason of incompatibility of pyguppy version since the current ont-pyguppy-client-lib
is updated to 5.1.9 (https://pypi.org/project/ont-pyguppy-client-lib/)
Thank you so much for your help!
Also, do you have an idea about the running speed and resource for remora since it is claimed that remora a small model? I tried Megalodon v2.4.0 with remora 0.1.1 with GPU node and after 6h it is still ongoing. For comparison, I used Megalodon v2.3.4 before on the same datasets and it only took ~2h to get the results. And I got the error below(but the job is not paused because of the error):
[08:59:29] Running Megalodon version 2.4.0
[08:59:29] Loading guppy basecalling backend
[08:59:34] CRF models are not fully supported.
I also checked the guppy log fille and it seems that Guppy is stuck with the disconnect issue? I didn't encounter the issue before with older version. Not sure the issue is coming from Guppy side or remora side.
2021-12-03 08:59:30.733174 [guppy/info] crashpad_handler not supported on this platform.
2021-12-03 08:59:30.733940 [guppy/info] Listening on port 44018.
2021-12-03 08:59:30.971271 [guppy/info] CUDA device 0 (compute 6.0) initialised, memory limit 17071734784B (16804216832B free)
2021-12-03 08:59:34.610117 [guppy/message]
Config loaded:
config file: data/dna_r9.4.1_450bps_hac.cfg
model file: data/template_r9.4.1_450bps_hac.jsn
model version id 2021-05-17_dna_r9.4.1_minion_384_d37a2ab9
adapter scaler model file: None
2021-12-03 08:59:34.614397 [guppy/message] Starting server on port: 44018
2021-12-03 08:59:34.679019 [guppy/info] client connection request. ["dna_r9.4.1_450bps_hac:>timeout_interval=15000>client_name=>alignment_type=auto:::"]
2021-12-03 08:59:34.679228 [guppy/info] New client connected Client 1 anonymous_client_1 id: 3a06644b-453a-421c-9bb6-d86bda1899b5 (connection string = 'dna_r9.4.1_450bps_hac:>timeout_interval=15000>client_name=>alignment_type=auto:::').
2021-12-03 08:59:34.831218 [guppy/info] Client 1 anonymous_client_1 id: 3a06644b-453a-421c-9bb6-d86bda1899b5 has disconnected.
Thank you so much for your help!
Responses to comments here:
Bonito is served as an individual base caller under development apart from Guppy. So for the base calling step, I can either use default Guppy or Bonito (if accuracy matters more).
Bonito is currently the research basecaller, so updates to models and code will be pushed to bonito more quickly while Guppy with maintain production basecalling capabilities. Currently Remora is only runnable from within Bonito (and megalodon). Guppy support is coming in January.
though theoretically these two steps are separated from each other in the backend
The previous flip-flop modbase models used in megalodon did actually call canonical and modified bases at the same time. This framework did not allow for the separation of these two tasks. This was part of the motivation for the shift to the Remora framework for modified bases.
I can use remora alone with the pre-trained model to do methylation calling via logic/API
Yes. Remora can be used independently from Taiyaki and Megalodon with a pre-trained model.
Is there any reason that the model is saved in .onnx format?
ONNX provided a reasonable format for the transfer and execution of Remora models and allows for the packaging of metadata along with the model. Alternatives were considered and there was no perfect solution (pytorch checkpoints for example would require separate model releases for CPU and GPU models for example which is not suitable).
The logic of remora is somehow similar to megalodon because both of them utilized the outputs and neural network after the base calling step
This is not quite true. Remora only uses the signal+basecalls+move table. Megalodon with flip-flop modbase models used the neural network output directly.
Is it possible to extract the intermediate results or framework after base calling? If so, how?
It is possible to use the pyguppy API for this purpose. The intermediate output is too large to store on disk, so this option is not available. See docs here https://pypi.org/project/ont-pyguppy-client-lib/
Would really appreciate it if the data can be shared in the near future for community usage.
This is under consideration internally.
For comparison, I used Megalodon v2.3.4 before on the same datasets and it only took ~2h to get the results.
I would suggest running megalodon with the same guppy model for a comparison of Remora impact (will be larger impact using fast models, but should be minimal for sup models). The basecalling models have been updated within guppy, so a comparison with an older guppy version is not applicable.
@marcus1487 Sorry for the following questions relevant to the model performance since I said your reply in https://github.com/nanoporetech/megalodon/issues/239#issuecomment-1023427379
We use several different samples to verify the results of each Remora model. These samples include native data (generally matched to bisulfite or other technique), oligonucleotides with modified bases and in vitro modified PCR samples. We are working on making these validation tasks easier to replicated within Remora.
The 20%/80% cut off seems to work quite well for Remora models as with the previous flip-flop modified base models (run from megalodon).
Hi, Thanks for this amazing tool. I have several questions and would really appreciate it if you can help.
What's the difference between the pre-trained models
dna_r9.4.1_e8
anddna_r9.4.1_e8.1
?What's the relationship of the remora pre-trained models and models in rerio repo? In the latest Megalodon it seems that Megalodon will call the remora model, but in the previous one Megalodon is using rerio model. A little confused here.
Is remora independent from Megalodon and Taiyaki? Will remora replace Megalodon somehow in the future? Can you provide more information on its usages?
Is remora a new methylation calling tool or not? If so, how can I use the remora to call methylations? Any plan on a detailed tutorial like Megalodon?
Do you have any plan to release the training datasets for remora shown on NCM2021?
It seems that in default remora
ont-pyguppy-client-lib==5.1.9
, however, the latest version of Guppy is only 5.0.16 in the community. Is there a delay for the Guppy release? Or is it possible to use remora on the older version of Guppy?Thank you so much for your help!
Best, Ziwei