Support mixed dual/single indexed files in AssignReadGroupByBarcode

magicDGS commented 5 years ago

In https://github.com/magicDGS/ReadTools/issues/509, @robmaz report that he has some data barcoded with mixed dual/single indexes. He attached the FASTQ files with some reads and the full table with the information from barcoding (viola-452.txt).

I found that the mixed files are kind of an interesting CASAVA-like formatted FASTQs: for representing dual indexed file, they join the sequences with the + sign (ReadTools, by default, uses - as a separator as recommended by the SAM-specs). For handling this formatted file, there are actually several options with the current implementation:

For matching each index in the dual-barcoded samples independently: use the advance option to set a different delimiter (java property -Dreadtools.barcode_index_delimiter=+), and assign to the second barcode for the single indexed files a single N. This will count as a mistmatch (unless --nNoMismatch is specified) and thus it could cause problems with detection (e.g., for 0-mismatches allowed, will never get detected).
For matching both indexes in the dual-barcode together: in the barcode file, the dual-indexed samples should contain both indexes separated by + in the barcode-header (e.g., ACTG+GGTC) and the single indexed only its unique one (e.g., AGGC). This allows to play with parameters that can cause problems in the previous setup, but the dual-indexed samples might be also problematic to detect (e.g., the first barcode includes an extra base, ACTGT+GGTC, which will have lots of mismatches against the ACTG+GGTC as the extra T will be evaluated against the +)

As both approaches have their inconveniences, it will be nice to have a way to support mixed dual/single indexed files. This could be done after refactoring the barcode-detection classes (https://github.com/magicDGS/ReadTools/issues/113), as a new feature. Also, it shows that we should also add an argument to support other delimiters in barcode to convert to the standard (but maintaining the advance java property to set which one is the standard, as the SAM-specs only recommends the hyphen as separator.

robmaz commented 5 years ago

I tried that, but it gives an unexpected error again:

UNEXPECTED ERROR: Dangling meta character '+' near index 0

I am attaching the --verbosity DEBUG output.

demultiplex-394.txt

magicDGS commented 5 years ago

Can you please try with -Dreadtools.barcode_index_delimiter='\\+' - the barcode delimiter is a java-pattern (I should probably change that in the docs, but that's why it is an advance param). If it works, please report here, as I will need to update the docs...

robmaz commented 5 years ago

Calling as

java -Xmx8g -Dreadtools.barcode_index_delimiter='\+' -jar ~/.linuxbrew/Cellar/readtools/1.5.0/libexec/ReadTools.jar AssignReadGroupByBarcode --splitSample --barcodeFile Info/barcodes_394.txt --maximumMismatches 1 --output 21NewData/Pool-394 --input 00RawData/Pool_394a_180719_X514_FCHLH2VCCXY_L4_CHKPEI85218060227_1.fq.gz --input2 00RawData/Pool_394a_180719_X514_FCHLH2VCCXY_L4_CHKPEI85218060227_2.fq.gz --forceOverwrite true --verbosity DEBUG

now yields

... 15:38:52.376 INFO AssignReadGroupByBarcode - Barcode sequence (BC) separator: '\+' ... A USER ERROR has occurred: Unknown file is malformed: Barcode dictionary has 2 indexes, but read contains 1 barcodes. Failing read: E00514:354:HLH2VCCXY:4:1101:21968:1784 (CCCCCCCC)

I think that means now it does not recognize the two barcodes. What do you want to see in this INFO message? Before it said

INFO AssignReadGroupByBarcode - Barcode sequence (BC) separator: '+'

and recognized two barcodes. Using a single slash does not work either. Is '+' a special character?

magicDGS commented 5 years ago

The log shows that the regexp is picking up correctly the regexp from the cli. Did you have a look to the faulty read ("E00514:354:HLH2VCCXY:4:1101:21968:1784")? How does the header line looks like (it wasn't in the small dataset that you provided before)?

magicDGS commented 5 years ago

With the small dataset that you provided previously and \\+ it works in my computer; something is weird with the file that you are using - maybe the ones with single-index does not have two sequences separated by +, but only one; in that case, option 2 is the only that will work until different number of indexes is supported.

robmaz commented 5 years ago

It's the first read in the files and looks like

@E00514:354:HLH2VCCXY:4:1101:21968:1784 1:N:0:CCCCCCCC+CCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCCCCCCC + -A<F<F<F-AAJ<FA-JJ-<FAF7<F-7--7FJJF--A-AF--7AAA-F-AFJJJAJJJAFFJ-<F--7A<77AJ<<FF<7<J77AJ<F-AFAAFFJFJ<<F<-7FFJFJ<JFFAA<JF<J7AAFFFF<A<-<A7-F-77<A-<J7A<7F

in _1 and

@E00514:354:HLH2VCCXY:4:1101:21968:1784 2:N:0:CCCCCCCC+CCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCACCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCC + A-AAFJJJJ<<AJAJ-FJFFJ<FFJJJFJFA<FJFFFFJ--7AFJ-7F-FJ7J7AFJAJJ<<JF<FFJFFJJF--77FJAA-7-A7AFA-AA<7<AAAJ-AAF7AFAA-7<AFJJJ-<F-<-AF-7--A-<FA)77F<--<-AJFFA<J<

in _2

Am Fr., 24. Aug. 2018 um 16:43 Uhr schrieb Daniel Gómez-Sánchez < notifications@github.com>:

The log shows that the regexp is picking up correctly the regexp from the cli. Did you have a look to the faulty read ("E00514:354:HLH2VCCXY:4:1101:21968:1784")? How does the header line looks like (it wasn't in the small dataset that you provided before)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/512#issuecomment-415780690, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfMSEoeUUFj7NYYATMA4VjqVhJ4NRks5uUBD_gaJpZM4WKdSW .

magicDGS commented 5 years ago

Ok, I got the problem now: the Casava 1.8 format only allows ACTG in the barcode (see http://illumina.bioinfo.ucr.edu/ht/documentation/data-analysis-docs/CASAVA-FASTQ.pdf), and my regexp discards everything after [ACTGN]+ (I included the N because it appears). Althought it shouldn't be anything after [ACTGN]+, that was allowing more comments after the name and also the newer formats using 1 instead (see https://en.wikipedia.org/wiki/FASTQ_format).

To allow ReadTools to handle your data, the only way is to add a less restrictive regexp that captures everything in the "barcode" part of the read name unil the next white-space (including tab). This will allow your file to be treated as "CCCCCCCC+CCCCCCCC" barcoded, and if using the java property to dual index as "CCCCCCCC" + "CCCCCCCC".

I've already a fix that should be tested before getting in, and it can be included for next week onward (I'll do a point release for that); on the other hand, maybe we should rethink about allowing mixed single/dual indexing, because it looks like all the barcodes in your FASTQ will have the bc1+bc2 structure, and then: which one should be matched to the used-index on sequencing?

magicDGS commented 5 years ago

By the way, your read does not pass the quality checks (PF-tag). Maybe it will be worthy to add a filter to remove those reads for any standardization...

magicDGS commented 5 years ago

And also, the java property should be \+ - I am too use to write it into java code (which needs two due to double quote) that I forgot that from command line only 1 is required. That should be fixed after I get in #519 and do a patch release next Monday.

Thanks for your input @robmaz!

magicDGS / ReadTools

Support mixed dual/single indexed files in AssignReadGroupByBarcode #512