rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Parse medaka model from raw Nanopore reads #97

Closed jmtsuji closed 4 months ago

jmtsuji commented 9 months ago

Newer Nanopore FastQ files (generated by Dorado?) seem to have the basecalling model information built into the FastQ header. We could try to parse this into the basecalling model format used by medaka so that the user doesn't need to input it manually. Different strategies are possible:

The last "hybrid" option is probably best -- what do you think @LeeBergstrand ?

LeeBergstrand commented 8 months ago

@jmtsuji Short term lets do it config wise. So if you're not providing a custom config it gets it from the first long fastq file. I'm not sure how likely it would be do a batch across flow cell or base caller versions. Long term lets do hybrid. One good option would be to add a new column to sample.tsv with this info but that might require edits to its parsing and generation code. I keep a note of that for snakemake utility library. You would have file path columns and metadata columns.

jmtsuji commented 8 months ago

@jmtsuji Short term lets do it config wise. So if you're not providing a custom config it gets it from the first long fastq file. I'm not sure how likely it would be do a batch across flow cell or base caller versions. Long term lets do hybrid. One good option would be to add a new column to sample.tsv with this info but that might require edits to its parsing and generation code. I keep a note of that for snakemake utility library. You would have file path columns and metadata columns.

OK, this sounds good as a roadmap. In the short-term, let's just parse the model from the first FastQ file. Long-term, that could potentially work well to add the medaka model to the samples.tsv file as metadata for each sample... I agree, it could be nice to keep this kind of future functionality in mind when you are writing the utility library.

LeeBergstrand commented 7 months ago

@jmtsuji https://github.com/nanoporetech/medaka#models

Medaka can now get the base caller model from data in the FASTQ file built-in. You still need to specify the model in the CLI for older files.

jmtsuji commented 6 months ago

This is excellent news! We might be able to use this in a few different ways:

What are your thoughts about the best approach? The "easy way" could be implemented with minimal edits to the current code, even if we want to aim for the more advanced approach in future.

LeeBergstrand commented 4 months ago

Pursuing the easy way in https://github.com/rotary-genomics/rotary/pull/153

LeeBergstrand commented 4 months ago

Completed.