nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
488 stars 59 forks source link

Q: [v0.7.2] "alias" ignored when providing a samplesheet? #933

Closed sklages closed 2 months ago

sklages commented 2 months ago

I am obviously doing something wrong, but I don't see the problem :-)

Running a P2 flowcell with a pool "SAMPLE_ID" containing four libs/barcodes (barcode01-04).

Basecalling and demultiplexing is done with current dorado v0.7.2, mod_bases=5mCG_5hmCG.

Samplesheet I used:

flow_cell_id,kit,sample_id,experiment_id,barcode,alias
PAW49593,SQK-NBD114-96,SAMPLE_ID,240701_PAW49593_SAMPLE_ID_SQK-NBD114,barcode01,ALIAS-LIB1
PAW49593,SQK-NBD114-96,SAMPLE_ID,240701_PAW49593_SAMPLE_ID_SQK-NBD114,barcode02,ALIAS-LIB2
PAW49593,SQK-NBD114-96,SAMPLE_ID,240701_PAW49593_SAMPLE_ID_SQK-NBD114,barcode03,ALIAS-LIB3
PAW49593,SQK-NBD114-96,SAMPLE_ID,240701_PAW49593_SAMPLE_ID_SQK-NBD114,barcode04,ALIAS-LIB4

This results in the following BAM header, no alias showing up, instead SQK-NBD114-96_barcode01 to SQK-NBD114-96_barcode04:

@HD VN:1.6  SO:unknown
@PG ID:basecaller   PN:dorado   VN:0.7.2+9ac85c65   CL:dorado basecaller sup,5mCG_5hmCG /dev/shm/mxqd/mnt/job/51421617 --device cuda:all --batchsize 0 --trim all --kit-name SQK-NBD114-96 --barcode-both-ends --sample-sheet /path/to/samplesheet.csv  DS:gpu:NVIDIA A100-PCIE-40GB
@PG ID:samtools PN:samtools PP:basecaller   VN:1.19.2   CL:samtools view -H /path/to/SAMPLE_ID.basecalls-sup.5mCG_5hmCG.demux.u.bam
@RG ID:56608be6739cc3ef561ed04cbc494d33014cce08_dna_r10.4.1_e8.2_400bps_sup@v5.0.0  PU:PAW49593 PM:dory DT:2024-07-01T09:09:17.208+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 modbase_models=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 runid=56608be6739cc3ef561ed04cbc494d33014cce08 LB:SAMPLE_ID    SM:SAMPLE_ID
@RG ID:56608be6739cc3ef561ed04cbc494d33014cce08_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-96_barcode01  PU:PAW49593 PM:dory DT:2024-07-01T09:09:17.208+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 modbase_models=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 runid=56608be6739cc3ef561ed04cbc494d33014cce08 LB:SAMPLE_ID    SM:SAMPLE_ID    BC:CACAAAGACACCGACAACTTTCTT
@RG ID:56608be6739cc3ef561ed04cbc494d33014cce08_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-96_barcode02  PU:PAW49593 PM:dory DT:2024-07-01T09:09:17.208+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 modbase_models=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 runid=56608be6739cc3ef561ed04cbc494d33014cce08 LB:SAMPLE_ID    SM:SAMPLE_ID    BC:ACAGACGACTACAAACGGAATCGA
@RG ID:56608be6739cc3ef561ed04cbc494d33014cce08_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-96_barcode03  PU:PAW49593 PM:dory DT:2024-07-01T09:09:17.208+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 modbase_models=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 runid=56608be6739cc3ef561ed04cbc494d33014cce08 LB:SAMPLE_ID    SM:SAMPLE_ID    BC:CCTGGTAACTGGGACACAAGACTC
@RG ID:56608be6739cc3ef561ed04cbc494d33014cce08_dna_r10.4.1_e8.2_400bps_sup@v5.0.0_SQK-NBD114-96_barcode04  PU:PAW49593 PM:dory DT:2024-07-01T09:09:17.208+00:00    PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_sup@v5.0.0 modbase_models=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1 runid=56608be6739cc3ef561ed04cbc494d33014cce08 LB:SAMPLE_ID    SM:SAMPLE_ID    BC:TAGGGAAACACGATAGAATCCGAA

The final dorado demux call accordingly writes BAM files not using aliases:

SQK-NBD114-96_barcode01.bam
SQK-NBD114-96_barcode02.bam
SQK-NBD114-96_barcode03.bam
SQK-NBD114-96_barcode04.bam

.. instead of:

ALIAS-LIB1.bam
ALIAS-LIB2.bam
ALIAS-LIB3.bam
ALIAS-LIB4.bam

Using an identical sample_id with distinct barcodes/aliases was at least working in v0.7.0 ...

What do I miss here? Where is my mistake?

malton-ont commented 2 months ago

Hi @sklages,

When run via the basecaller, sample sheets only apply aliases to the specific experiment they are set up for. I would guess that this data has a different experiment id to the one in the sample sheet. You can check this value by running:

pod5 inspect debug /dev/shm/mxqd/mnt/job/51421617/*.pod5 | grep experiment_name
sklages commented 2 months ago

Indeed, .. the rundata folder (aka experiment_id) has been renamed after the run has finished .. because of a typo.

Maybe - as a feature request - there should be a warning or even a error about mismatching data, here experiment_id from samplesheet and the experiment_id actually found in the pod5 data before starting the basecalling!?

thanks for the hint though :-)

phpeters commented 2 months ago

Hej,

I have a similar issue. I used to have the aliasID in my bam- and summary files, now I get barcode-IDs such as BC:Z:SQK-NBD114-96_barcode02. I'm using the SQK-NBD114-96 kit, dorado 0.7.2 with the commands mentioned below.

In my samplesheet: experiment_id = 123456 In my pod5-files: (pod5 inspect debug pod5/FLOWCELL_pass_barcode02_id_id2_0.pod5 | grep experiment_name ) experiment_name: 123456

When applying demultiplexing (no-classify, see below), the files are properly named by the alias.

Is this part of the "updates to barcode classification" metnioned in the changelog? Thanks and all the best! Philipp

# dorado
# basecaller
dorado basecaller -v  dna_r10.4.1_e8.2_400bps_sup@v5.0.0 pod5/  --sample-sheet mySampleSheet.csv --kit-name SQK-NBD114-96  > 123456.bam
# demux
dorado demux  -t 16 --output-dir demux_123456 -v  --no-classify --sample-sheet mySampleSheet.csv  --emit-fastq  123456.bam
malton-ont commented 2 months ago

Hi @phpeters,

No, the updates are regarding the way we choose which classification to select, not in the naming.

Can you also check flowcell_id and sequencer_position in the grep above? These need to match your flowcell_id and/or position_id columns (only one or the other needs to be present).

(For clarity, these columns need to match when using the sample sheet during basecalling. When used with the demux command much of this data is not available, so in this case we simply require that there is a unique mapping from barcode name to alias name. This probably explains why you get aliases during demux.)

phpeters commented 2 months ago

Hej @malton-ont ,

Thanks for the clarification! I checked and the sequencer_position in the pod5 is equal to the position_id in the sampleSheet. (flowcell_id is not present)

Best! Philipp

malton-ont commented 2 months ago

@phpeters,

Are you able to share a read that exhibits this, and the sample sheet to match?

phpeters commented 2 months ago

@malton-ont I did a subset, can I upoad it to a box somewhere? It is client's data

malton-ont commented 2 months ago

@phpeters,

Are you able to open a support ticket? Technical services can then give you a link to upload it. Make sure you ask them to direct it to me!

phpeters commented 2 months ago

@malton-ont I shared the small subset in a support ticket and asked to forward it to you. Do you need/want the ticket ID?

malton-ont commented 2 months ago

@phpeters,

Yes please, then I can chase it up.

phpeters commented 2 months ago

@malton-ont you have mail

malton-ont commented 2 months ago

@phpeters,

Ah-ha! Apologies, the correct grep was flow_cell_id - checking this shows that this value is present in the pod5 but it's blank in the sample sheet. You can either add the correct value to the sample sheet or remove the column entirely (sample sheets need at least one, but not necessarily both, of flowcell_id and position_id).

Note that the debug info that is printed regarding the Barcode distribution does not apply the aliasing, but this is applied for the BC:Z tag and read groups (and the filename when demuxing).

phpeters commented 2 months ago

ah-haaaa! Indeed, without the column flow_cell_id in the sample sheet it worked out just fine. Thanks a ton!

But this is the original sample sheet I get from our minION (only extended by the columns barcode,alias ). The promethION puts the flowcell_ID properly into the column, the minION's minKNOW doesn't do this. The MinKNOW version for the minION is 23.11.7, on the promethION it's 24.02.19 - maybe it's that?

Thanks again! Philipp

phpeters commented 2 months ago

Aaaaaaah-haaaaaa! I checked previous runs in the same machine (with the same MinKNOW version) and for them, the column flow_cell_id was present in the sample sheet. But those were MIN114-flowcells, this time it was a FLG114. And I just learned from the lab that the flowcell-ID is put in manually for flongle FCs whereas it is automagically detected for MIN FCs. My mystery is solved, sorry for bothering you. Thanks again and have a great weekend! Philipp

malton-ont commented 2 months ago

That's great @phpeters, glad the mystery is solved! And thanks for your help investigating.