Open warthmann opened 1 month ago
If you have pod5 input data you should be able to view the sequencing kit and flowcell code with:
pod5 inspect debug <data.pod5> | grep -E "flow_cell_product_code|sequencing_kit"
Dorado will auto select models for your chemistry if you provide a model complex as the model argument and the data has the required metadata to resolve the chemistry.
dorado basecaller hac data/ > calls.bam
Kind regards, Rich
No pod5. It is fast5 files from 2019 with v9 chemistry, possibly barcoded with either the rapid or the native, but we don't know. In the past, I have demultiplexed for both and checked for the more plausible output. I am now hoping there exists a less pedestrian diagnostic tool.
You could convert one of the fast5 files to pod5 and run the above command to get the answer (assuming your dataset is from the same run).
Or convert them all and get better basecalling performance too.
I am now hoping there exists a less pedestrian diagnostic tool.
If your data is in pod5 format and is from a supported chemistry then dorado will do the model selection for you, but I think that only partly addresses your question.
Does this reworded feature request define what you want?
Given some input data can dorado;
Best regards, Rich
great! thank you for your assistance! YES! your rephrased feature request is spot on.
Meanwhile: As suggested, I did convert all fast5 to pod5 and ran (above) 'pod5 inspect debug' on one of them. This is the output:
context_tags: {'experiment_duration_set': '2880', 'experiment_type': 'genomic_dna', 'fast5_output_fastq_in_hdf': '1', 'fast5_raw': '1', 'fast5_reads_per_folder': '4000', 'fastq_enabled': '1', 'fastq_reads_per_file': '4000', 'filename': 'pbgl_linux_1_20190314_faj07153_mn25910_sequencing_run_sorg1_61692', 'flowcell_type': 'flo-min106', 'kit_classification': 'none', 'local_basecalling': '0', 'sample_frequency': '4000', 'sequencing_kit': 'sqk-lsk109', 'user_filename_input': 'sorg1'}
flow_cell_product_code:
sequencing_kit: sqk-lsk109
I take that no mention of a barcode kit means that none was used?
On a similar note: I have basecalled with dorado from fast5 as well as from pod5, and only from pod5 do I get the following warning:
[warning] Could not determine sequencing Chemistry from read data - some features might be disabled
Wouldn't chemistry follow from the sequencing kit that was used and which is indeed recorded in the pod5 files? Also, I am giving an explicit model, which, admittedly, was a guess.
Wouldn't chemistry follow from the sequencing kit that was used and which is indeed recorded in the pod5 files? Also, I am giving an explicit model, which, admittedly, was a guess.
It looks like the flow_cell_product_code
hasn't been populated during the conversion likely because the fast5 files are quite old. This mean that dorado can't auto select a model for you using the model complex (because we need both FC code and sequencing kit) so you'll need to download a model and use it's path (as you've already done).
We can see from the context tags that 'flowcell_type': 'flo-min106'
and the sequencing_kit: sqk-lsk109
. Therefore, this data is dna_r9.4.1_e8
and sampled at 4kHz.
You can use any of these models:
> dorado download --list
...
[2024-09-18 13:40:53.707] [info] > simplex models
[2024-09-18 13:40:53.707] [info] - dna_r9.4.1_e8_fast@v3.4
[2024-09-18 13:40:53.707] [info] - dna_r9.4.1_e8_hac@v3.3
[2024-09-18 13:40:53.707] [info] - dna_r9.4.1_e8_sup@v3.3
[2024-09-18 13:40:53.707] [info] - dna_r9.4.1_e8_sup@v3.6
...
As for the barcoding kit. these are all the barcoding kits that are registered with dorado for this chemistry:
SQK_16S024
SQK_MLK111_96_XL
SQK_NBD111_24
SQK_NBD111_96
SQK_PCB109
SQK_PCB110
SQK_PCB111_24
SQK_RBK001
SQK_RBK004
SQK_RBK110_96
SQK_RBK111_24
SQK_RBK111_96
SQK_RLB001
SQK_LWB001
SQK_PBK004
SQK_RAB201
SQK_RAB204
SQK_RPB004
VSK_PTC001
VSK_VMK001
VSK_VPS001
VSK_VMK004
I hope this helps.
As for the feature request - Great! I'll discuss this with the team and get back to you.
Best regards, Rich
Hello, this not an issue, but possibly a feature request if it doesn't already exist. We have ONT legacy data that we would like to analyse, but have no record whether or not the reads are barcoded and, if yes, which kit was used. Does Dorado provide a tool to quickly find out? Similar for the model. would be great if dorado could make suggestions as to which chemistry had been used and what models are therefore appropriate.
thanks a lot
best Norman