Is it possible to set demux to use only barcodes present in sample sheet ?

olawa commented 5 months ago

I am trying demux with sample sheet and get the following error when only the barcodes in use are included:

what(): Row in sample sheet file samplesheet.txt has incorrect number of entries

It took a while to figure out that all 12 barcodes need to be present in the sheet. Perhaps you could add a few example sheets to the repo.

What I would like to be able to do is:

demux to use the kit specified in sample sheet
only classify reads present in the run (ideally with a tag specifying if it was found on both ends)
write reads with more than one of the specified barcodes to a separate file (unless demux can split it on internal adaptors) to be able to identify ligation chimeras
option to only split on barcode (if more than one flowcell was used)

Is any of this possible with the current demux?

tijyojwad commented 5 months ago

Hi @olawa - what version of dorado are you using?

what(): Row in sample sheet file samplesheet.txt has incorrect number of entries

hmm I don't think this is the intended behavior. You should be able to specify only the desired barcodes. we will look into this

demux to use the kit specified in sample sheet

yes we can look into this

only classify reads present in the run

I'm not sure I understand this... If you're finding read ids that aren't in the input pod5 that's due to read splitting. The parent read ids will be in the pi:Z tag of that read. If you want to filter on double ended barcode hits, you can also run --barcode-both-ends.

write reads with more than one of the specified barcodes to a separate file

reads are split within dorado, which should catch most cases. currently if 2 different barcodes are detected on either end, we treat them as unclassified.

option to only split on barcode (if more than one flowcell was used)

dorado demux does exactly this, right? it'll output a BAM file per barcode. You can combine pod5s/bams from multiple runs and give it to dorado. 0.6.0 onwards you can also give dorado demux a folder with multiple BAMs in it

malton-ont commented 5 months ago

Hi @olawa,

That error indicates that one or more of the rows had a different number of entries to the number of column headings. Samples sheets should absolutely work with only a subset of the barcodes from the kit. Note that a sample sheet should be defined with comma-separated variables, and empty columns must still be included. See the documentation here for more information.

olawa commented 5 months ago

Hi @malton-ont @tijyojwad , thanks for the clarification. I got it to work as intended with dorado 0.6 now, could have been an extra comma in the header from converting tabs.

only classify reads present in the run

I meant barcodes. I am sequencing short reads with PBC96, they are amplified so should then have barcodes on both ends. --barcode-both-ends gives much lower classification rate. I am guessing it could be improved if one could exclude only reads were two different barcodes from the list is found, assuming the issue is either poor basecalling at the ends or chimeric ligation products.

Is the dorado split during basecall supposed to split on internal barcode/primers or is it just able to split Minknow chimeras?

One example here where a female sample has alignment to chrY from what appears to be a ligation concatemer. If I have to use guppy (or perhaps pychopper) to split on internal primers that is fine but I cant't find any documantation on it.

tijyojwad commented 5 months ago

Is the dorado split during basecall supposed to split on internal barcode/primers or is it just able to split Minknow chimeras?

It doesn't split on internal barcodes, just on sequencing adapters. So it'll catch chimeras, but not ligation concatemers.

I believe guppy did have an option to split on internal barcodes. We can look into adding that to dorado in a subsequent release.

nanoporetech / dorado

Is it possible to set demux to use only barcodes present in sample sheet ? #756