Open magicDGS opened 5 years ago
I tried that, but it gives an unexpected error again:
UNEXPECTED ERROR: Dangling meta character '+' near index 0
I am attaching the --verbosity DEBUG output.
Can you please try with -Dreadtools.barcode_index_delimiter='\\+'
- the barcode delimiter is a java-pattern (I should probably change that in the docs, but that's why it is an advance param). If it works, please report here, as I will need to update the docs...
Calling as
java -Xmx8g -Dreadtools.barcode_index_delimiter='\+' -jar ~/.linuxbrew/Cellar/readtools/1.5.0/libexec/ReadTools.jar AssignReadGroupByBarcode --splitSample --barcodeFile Info/barcodes_394.txt --maximumMismatches 1 --output 21NewData/Pool-394 --input 00RawData/Pool_394a_180719_X514_FCHLH2VCCXY_L4_CHKPEI85218060227_1.fq.gz --input2 00RawData/Pool_394a_180719_X514_FCHLH2VCCXY_L4_CHKPEI85218060227_2.fq.gz --forceOverwrite true --verbosity DEBUG
now yields
... 15:38:52.376 INFO AssignReadGroupByBarcode - Barcode sequence (BC) separator: '\+' ... A USER ERROR has occurred: Unknown file is malformed: Barcode dictionary has 2 indexes, but read contains 1 barcodes. Failing read: E00514:354:HLH2VCCXY:4:1101:21968:1784 (CCCCCCCC)
I think that means now it does not recognize the two barcodes. What do you want to see in this INFO message? Before it said
INFO AssignReadGroupByBarcode - Barcode sequence (BC) separator: '+'
and recognized two barcodes. Using a single slash does not work either. Is '+' a special character?
The log shows that the regexp is picking up correctly the regexp from the cli. Did you have a look to the faulty read ("E00514:354:HLH2VCCXY:4:1101:21968:1784")? How does the header line looks like (it wasn't in the small dataset that you provided before)?
With the small dataset that you provided previously and \\+
it works in my computer; something is weird with the file that you are using - maybe the ones with single-index does not have two sequences separated by +
, but only one; in that case, option 2 is the only that will work until different number of indexes is supported.
It's the first read in the files and looks like
@E00514:354:HLH2VCCXY:4:1101:21968:1784 1:N:0:CCCCCCCC+CCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCCCCCCC + -A<F<F<F-AAJ<FA-JJ-<FAF7<F-7--7FJJF--A-AF--7AAA-F-AFJJJAJJJAFFJ-<F--7A<77AJ<<FF<7<J77AJ<F-AFAAFFJFJ<<F<-7FFJFJ<JFFAA<JF<J7AAFFFF<A<-<A7-F-77<A-<J7A<7F
in _1 and
@E00514:354:HLH2VCCXY:4:1101:21968:1784 2:N:0:CCCCCCCC+CCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCACCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCC + A-AAFJJJJ<<AJAJ-FJFFJ<FFJJJFJFA<FJFFFFJ--7AFJ-7F-FJ7J7AFJAJJ<<JF<FFJFFJJF--77FJAA-7-A7AFA-AA<7<AAAJ-AAF7AFAA-7<AFJJJ-<F-<-AF-7--A-<FA)77F<--<-AJFFA<J<
in _2
Am Fr., 24. Aug. 2018 um 16:43 Uhr schrieb Daniel Gómez-Sánchez < notifications@github.com>:
The log shows that the regexp is picking up correctly the regexp from the cli. Did you have a look to the faulty read ("E00514:354:HLH2VCCXY:4:1101:21968:1784")? How does the header line looks like (it wasn't in the small dataset that you provided before)?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/magicDGS/ReadTools/issues/512#issuecomment-415780690, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_FfMSEoeUUFj7NYYATMA4VjqVhJ4NRks5uUBD_gaJpZM4WKdSW .
Ok, I got the problem now: the Casava 1.8 format only allows ACTG
in the barcode (see http://illumina.bioinfo.ucr.edu/ht/documentation/data-analysis-docs/CASAVA-FASTQ.pdf), and my regexp discards everything after [ACTGN]+
(I included the N because it appears). Althought it shouldn't be anything after [ACTGN]+
, that was allowing more comments after the name and also the newer formats using 1
instead (see https://en.wikipedia.org/wiki/FASTQ_format).
To allow ReadTools to handle your data, the only way is to add a less restrictive regexp that captures everything in the "barcode" part of the read name unil the next white-space (including tab). This will allow your file to be treated as "CCCCCCCC+CCCCCCCC" barcoded, and if using the java property to dual index as "CCCCCCCC" + "CCCCCCCC".
I've already a fix that should be tested before getting in, and it can be included for next week onward (I'll do a point release for that); on the other hand, maybe we should rethink about allowing mixed single/dual indexing, because it looks like all the barcodes in your FASTQ will have the bc1+bc2
structure, and then: which one should be matched to the used-index on sequencing?
By the way, your read does not pass the quality checks (PF-tag). Maybe it will be worthy to add a filter to remove those reads for any standardization...
And also, the java property should be \+
- I am too use to write it into java code (which needs two due to double quote) that I forgot that from command line only 1 is required. That should be fixed after I get in #519 and do a patch release next Monday.
Thanks for your input @robmaz!
In https://github.com/magicDGS/ReadTools/issues/509, @robmaz report that he has some data barcoded with mixed dual/single indexes. He attached the FASTQ files with some reads and the full table with the information from barcoding (
viola-452.txt
).I found that the mixed files are kind of an interesting CASAVA-like formatted FASTQs: for representing dual indexed file, they join the sequences with the
+
sign (ReadTools, by default, uses-
as a separator as recommended by the SAM-specs). For handling this formatted file, there are actually several options with the current implementation:-Dreadtools.barcode_index_delimiter=+
), and assign to the second barcode for the single indexed files a single N. This will count as a mistmatch (unless--nNoMismatch
is specified) and thus it could cause problems with detection (e.g., for 0-mismatches allowed, will never get detected).+
in the barcode-header (e.g.,ACTG+GGTC
) and the single indexed only its unique one (e.g.,AGGC
). This allows to play with parameters that can cause problems in the previous setup, but the dual-indexed samples might be also problematic to detect (e.g., the first barcode includes an extra base,ACTGT+GGTC
, which will have lots of mismatches against theACTG+GGTC
as the extraT
will be evaluated against the+
)As both approaches have their inconveniences, it will be nice to have a way to support mixed dual/single indexed files. This could be done after refactoring the barcode-detection classes (https://github.com/magicDGS/ReadTools/issues/113), as a new feature. Also, it shows that we should also add an argument to support other delimiters in barcode to convert to the standard (but maintaining the advance java property to set which one is the standard, as the SAM-specs only recommends the hyphen as separator.