lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.78k stars 407 forks source link

Error while aligning ONT reads: wrong FASTA/FASTQ record. Continue anyway. #510

Closed zahrahaider closed 4 years ago

zahrahaider commented 4 years ago

Hi,

I have used guppy 2.1.3 to perform basecalling of cDNA sequencing reads generated by nanopore sequencing. Basecalling generates multiple fastq files of the same library so I use cat to merge all of them in a single fastq file. When I align my reads to the reference genome using minimap2 (-ax splice), most of the reads align but I get this error: "wrong FASTA/FASTQ record. Continue anyway." What does this error mean and how can I know which read in my fastq file is giving this error?

tseemann commented 4 years ago

Run seqtk fqchk -q0 reads.fastq and see if any errors appear.

Does the minimap2 error occur if you run it just on one of the smaller original FASTQ files?

zahrahaider commented 4 years ago

Thanks for your response! I have now run seqtk fqchk -q0 reads.fastq on the concatenated fastq file and there are no errors. Also, I just ran minimap2 on a single small fastq file and it didn´t give me any error. Is it possible to run minimap2 on multiple fastq files from the same library and merge the output as a single sam file?

tseemann commented 4 years ago
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Just give it all the FASTQ files at once? eg. minimap2 database A.fq B.fq.gz C.fq.gz D.fq.gz

zahrahaider commented 4 years ago

Do you think I can use *.fastq files to add multiple files at once? There are ~1200 individual fastq files which makes the argument too long....!
I am also considering running basecalling again but this time with the option of producing one fastq file with all reads.

tseemann commented 4 years ago

If the command line becomes too long, there are various tricks you could use.

find /path/to/fastq/files -name "*.fastq" -print0 | xargs -0 cat | gzip > all_reads.fq.gz

Will put all the fastq into one fastq.

zahrahaider commented 4 years ago

Hi Torsten!

Combining fastq files as you suggested actually worked well for another library. I will probably redo guppy basecalling again on the faulty library and run it with minimap2.

Thanks alot for your help!

lh3 commented 4 years ago

When there is a wrong FASTA/FASTQ record, minimap2 now should report the read name of the last correct record. This will help you to identify the error.

mvheetve commented 4 years ago

Hi same issue here. Couldn't find the read name where it went wrong. Trimming and filtering with fastp, demultiplexing with minibar prior to minimap2. Stdout empty and stderr said the following:

[M::mm_idx_gen::12.278*0.98] collected minimizers [M::mm_idx_gen::20.184*0.99] sorted minimizers [M::main::20.202*0.99] loaded/built the index for 180253 target sequence(s) [M::mm_mapopt_update::20.913*0.99] mid_occ = 109 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 180253 [M::mm_idx_stat::21.425*0.99] distinct minimizers: 27188252 (39.67% are singletons); average occurrences: 3.564; average spacing: 2.964 [WARNING] wrong FASTA/FASTQ record. Continue anyway. [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.491*0.99] mapped 2 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.492*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.504*0.99] mapped 2 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.580*0.99] mapped 2 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.580*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.580*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.581*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.586*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::worker_pipeline::21.586*0.99] mapped 1 sequences [WARNING] wrong FASTA/FASTQ record. Continue anyway. [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax splice /data/.../Homo_sapiens.GRCh37.75.cdna.all.fa /data/.../ONT_RUN3_trimmed_EBV.fastq.gz

Any idea why?

katievigil commented 2 months ago

@mvheetve did you figure this out? I aligned my blank reads against my environmental samples to take out any contamination using minimap2 and now Diamond is saying that my .fastq files generated by minimap2 are in the incorrect format.

mvheetve commented 2 months ago

Hi there @lucyintheskyzzz!

To be fair, this issue has been a while... I'm not entirely sure anymore how I fixed this, but I'd be happy to do some troubleshooting with you. Did minimap2 give you the same (or a different) warning or error? Also, what does Diamond say specifically? If you could include some files/snippets that would be great.

If I had to guess, I would say I removed the troublesome reads from the fastqs and reran minimap2.

Regards Mattias

katievigil commented 2 months ago

Hi @mvheetve I was able to align my samples using Minimap2, but all my output .fastq files are in the wrong format now and will not run on Diamond. I basically sequenced some blank samples and aligned them against my environmental samples using minimap2 then my output was supposed to be a new .fastq file without any of the blank contamination reads in my samples, but the .fastq files are now in the incorrect format. minimap2_blankfilter.v3.txt

Attached is my minimap2 code. It looks like minimap chopped away the read "@" name @ad22310b-780e-45ca-ad73-3d8fec2f55c6

See below an example of one of my .fastq files:

fastq file before minimap2 filter: (nanoplot-env) [kvigil@qbd2 sd_seawater]$ head sd6_SQK-NBD114-24_barcode21.fastq @ad22310b-780e-45ca-ad73-3d8fec2f55c6 TGTTACTGGTTCGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCTAGGTACCAGCACCTGTTTCCCAGTCACGATCATTCTGTTTCTATATCAAATACTAACTGCATTAAAATAATTCATCTAAGGTTTCATCAAATTCTATATCTTTATCAGATACTTCAGACAGTCTACCTGTTTCTGAATCATATACAACACTACATGCCATTCCAACATCGCCTGTATATCTCGATTTAAGAACACGTAGTCTAGTTGTCCTTGCTTCTTCAGCATCATCTGATTGTTGGTTTCTTTCTAATGCTATGACACAATCACTAAGTTGTCCGATACTATTAGAACCTCTAAGATGAGATAGAGAAACCTCTATACCATTCTCATGTCCTTTGTTACCATCAACTCTACGTAAGTGTGAAACTAAAATGATACCTGCACCTGTTTCTTCTACTAAACTTTTAAGCCTAGTCATAATAGAATCAATAGCTCTTCTCTCATCTCCTTCATGGACAGCACTAACTAACATATGTAAATGGTCTACTATCACCCACTTGCAGTCGCACCCTATAATCATAAAGCGAAGCTTAGTAAAGATATCATCTATGTCATTGGATCCAAAGATCGTGACTGGGAAACAGGTGCTGTAAAGCAGCAATGAGAGGCTCTTAACCTTAGCAATACG + $%#$$$'(((),''''6?BACEGEHSMLJJLD@@AAAACCB>;;<?74,*'&$$%%%''(*+,/1-'(*&'-001,,,,.43///004--ABCEGKSPMKRSRSPJPNISIGIKJL<<?C..422346=??@?AF=<<<;@==@@GKSPJKSOSLNMMMSKNOOSKSILPSKSMQIGSPISGFGGEEJ::9.((()DFFGLSJKEHKJSNSLPOOOSOLLPJMSOJPS;;<<HKKQHJSILLLMLHJSJSSLKSNJMPJHEEEGIIDAAABFFDEJMILSNSOSOSROMKMSMSRONRSLNKLNSJSNSJSRJJMSSJONNMSSRSSQMSLMSQOPORKOQISCHMHIA@@::9:;LPKNSSQLJJIJKBF9:AABPIMPKPSSSLONNLMECCCDEKLOMLNMMOKOOPSONSHKLLPNNSSSSKMMSKKKSNSSSLRPLKMLMSQNG:,+++,GEDEFDDBCA@C;999:98760///**)())0/00AJLHHQKKHHIKHJHKJJQSSSJMSLMKSSSSKSOSKOJSOPSLKJKRHIJFFHFABDGHFDEEGNIEBCCEEJNMKNMSSOIIJSSS???@AGMIIJIFLIIFEGFC=<>??@PKJHMKIIJFAA?BCDEGDEFE:97678?@@?)&&&.///---&%%%%%%&&'(&'/22('((((+( @7854a4c8-5e7c-48f4-87c6-c1dfde4641f6 TATGTTTTATTTTATTCGTTCAGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCCGTTCTCTACAGCACCTGTTTCCCAGTCACGATCCACGTTGTGGATCTGGGCGAGATTCTGACTTAGAGGCGTTCAGTCATAATCCCTGCGGATCGGTAGCTTCGCACCATTGGCTTATGAGCCAAGCACATAAACCAAATGTCTGAACCTGCGGTTCCTCTCGTACTGAGCAGGATTACCGTTGCAACGACTGCAGTCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGTTCCCTATTACGATCGTGACTGGGAAACAGGTGCTGTAGAGAACGGACAATGAGAGGCTCTTAACCTTAGCGATACG + %$$$###$##$&&%*+++/9:@ABB@A@@=??>>?BCBFHGGHFGHLIPQLJMKJNIRMKLHNLSSIFFIIMKIIQPILMSPIIKIJRMNHLQJSLKKRONKJKONKG?88><9?A?B>>>>??>??JQJMPGHKMKSMKGKDE96'&%%%&)1112:>>?BCFIGGNNLJKIEJLS77778LNRSJIJNHJIHGDILQMJLRNRMLKSSOLSSLSKPLLKHPMSPFNGGFIFDCACBDCEDCEGGGEE@@@@@SGMIOJIGGIHHLIMGQNSSOHKIHLMHKNLSSSLJJLSHPIHHJHISPSPIJGIHJHJNGINIFFIGJGFHGGFIMGGMJLISGFEFJHJJKFEDECDDFEEIJJGFFDDDHFEGEDFGDC877.--++) @4c2947d0-5cec-4fa0-bdc3-3ab10f7522d9 GTGTTATGTCTTAACCTACTCGTTCAGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCCGTTCTCTACAGCACCTGTTTCCCAGTCACGATCAAGATCGCAATTAAGCAGACAAATCACTCCACCAACTAAGAACGGCCATGCACCACCACCCACAAAATCAAGAAAGAGCTCTCAATCTGTCAATCCTACTTGTGTCTGGACCTGGTGAGTTTCCCCGTGTTGGGTCAAATTAAGCCGCGGGCTCCACGCCTGGTGGTGCCCTTCCGTCAATTTCTTTAAGTTTCAGCCTTGCGACCATACTCCCCCCAGAACCCAAAAACTTTGATTTCTCGTAAGGTGCCGATGGGGTCAAAACAATAACACCCACCGATCCCTAGTCGGCATAGTTTATGGTTAAGACTACGACGGTATCTGACCGTCTTCGATTCCCTAACTTTCGTTCACTGATTAATGAAAACATCCTTGGCAAATGCTTTCGCAGTAGTTTGTCTTCAATAAATCAGTGAACGAATCTCGGGGATGATAGGACTGGGAAACAGGTGCTGTAGAGAACGGACAATGAGAGGCTCTTAACCTTAGCAATTGGTA

fastq file after minimap2 filter: (nanoplot-env) [kvigil@qbd2 minimap2_fastq]$ head sd6_SQK-NBD114-24_barcode21_without_blank.fastq TGTTACTGGTTCGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCTAGGTACCAGCACCTGTTTCCCAGTCACGATCATTCTGTTTCTATATCAAATACTAACTGCATTAAAATAATTCATCTAAGGTTTCATCAAATTCTATATCTTTATCAGATACTTCAGACAGTCTACCTGTTTCTGAATCATATACAACACTACATGCCATTCCAACATCGCCTGTATATCTCGATTTAAGAACACGTAGTCTAGTTGTCCTTGCTTCTTCAGCATCATCTGATTGTTGGTTTCTTTCTAATGCTATGACACAATCACTAAGTTGTCCGATACTATTAGAACCTCTAAGATGAGATAGAGAAACCTCTATACCATTCTCATGTCCTTTGTTACCATCAACTCTACGTAAGTGTGAAACTAAAATGATACCTGCACCTGTTTCTTCTACTAAACTTTTAAGCCTAGTCATAATAGAATCAATAGCTCTTCTCTCATCTCCTTCATGGACAGCACTAACTAACATATGTAAATGGTCTACTATCACCCACTTGCAGTCGCACCCTATAATCATAAAGCGAAGCTTAGTAAAGATATCATCTATGTCATTGGATCCAAAGATCGTGACTGGGAAACAGGTGCTGTAAAGCAGCAATGAGAGGCTCTTAACCTTAGCAATACG + $%#$$$'(((),''''6?BACEGEHSMLJJLD@@AAAACCB>;;<?74,*'&$$%%%''(*+,/1-'(*&'-001,,,,.43///004--ABCEGKSPMKRSRSPJPNISIGIKJL<<?C..422346=??@?AF=<<<;@==@@GKSPJKSOSLNMMMSKNOOSKSILPSKSMQIGSPISGFGGEEJ::9.((()DFFGLSJKEHKJSNSLPOOOSOLLPJMSOJPS;;<<HKKQHJSILLLMLHJSJSSLKSNJMPJHEEEGIIDAAABFFDEJMILSNSOSOSROMKMSMSRONRSLNKLNSJSNSJSRJJMSSJONNMSSRSSQMSLMSQOPORKOQISCHMHIA@@::9:;LPKNSSQLJJIJKBF9:AABPIMPKPSSSLONNLMECCCDEKLOMLNMMOKOOPSONSHKLLPNNSSSSKMMSKKKSNSSSLRPLKMLMSQNG:,+++,GEDEFDDBCA@C;999:98760///**)())0/00AJLHHQKKHHIKHJHKJJQSSSJMSLMKSSSSKSOSKOJSOPSLKJKRHIJFFHFABDGHFDEEGNIEBCCEEJNMKNMSSOIIJSSS???@AGMIIJIFLIIFEGFC=<>??@PKJHMKIIJFAA?BCDEGDEFE:97678?@@?)&&&.///---&%%%%%%&&'(&'/22('((((+( TATGTTTTATTTTATTCGTTCAGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCCGTTCTCTACAGCACCTGTTTCCCAGTCACGATCCACGTTGTGGATCTGGGCGAGATTCTGACTTAGAGGCGTTCAGTCATAATCCCTGCGGATCGGTAGCTTCGCACCATTGGCTTATGAGCCAAGCACATAAACCAAATGTCTGAACCTGCGGTTCCTCTCGTACTGAGCAGGATTACCGTTGCAACGACTGCAGTCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGTTCCCTATTACGATCGTGACTGGGAAACAGGTGCTGTAGAGAACGGACAATGAGAGGCTCTTAACCTTAGCGATACG + %$$$###$##$&&%*+++/9:@ABB@A@@=??>>?BCBFHGGHFGHLIPQLJMKJNIRMKLHNLSSIFFIIMKIIQPILMSPIIKIJRMNHLQJSLKKRONKJKONKG?88><9?A?B>>>>??>??JQJMPGHKMKSMKGKDE96'&%%%&)1112:>>?BCFIGGNNLJKIEJLS77778LNRSJIJNHJIHGDILQMJLRNRMLKSSOLSSLSKPLLKHPMSPFNGGFIFDCACBDCEDCEGGGEE@@@@@SGMIOJIGGIHHLIMGQNSSOHKIHLMHKNLSSSLJJLSHPIHHJHISPSPIJGIHJHJNGINIFFIGJGFHGGFIMGGMJLISGFEFJHJJKFEDECDDFEEIJJGFFDDDHFEGEDFGDC877.--++) GTGTTATGTCTTAACCTACTCGTTCAGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCCGTTCTCTACAGCACCTGTTTCCCAGTCACGATCAAGATCGCAATTAAGCAGACAAATCACTCCACCAACTAAGAACGGCCATGCACCACCACCCACAAAATCAAGAAAGAGCTCTCAATCTGTCAATCCTACTTGTGTCTGGACCTGGTGAGTTTCCCCGTGTTGGGTCAAATTAAGCCGCGGGCTCCACGCCTGGTGGTGCCCTTCCGTCAATTTCTTTAAGTTTCAGCCTTGCGACCATACTCCCCCCAGAACCCAAAAACTTTGATTTCTCGTAAGGTGCCGATGGGGTCAAAACAATAACACCCACCGATCCCTAGTCGGCATAGTTTATGGTTAAGACTACGACGGTATCTGACCGTCTTCGATTCCCTAACTTTCGTTCACTGATTAATGAAAACATCCTTGGCAAATGCTTTCGCAGTAGTTTGTCTTCAATAAATCAGTGAACGAATCTCGGGGATGATAGGACTGGGAAACAGGTGCTGTAGAGAACGGACAATGAGAGGCTCTTAACCTTAGCAATTGGTA +

$&&'&%$$$###$$%&*&&'0/3222...//3444A>=>ACIFEHGFHHKIIHEGGHSCB;;8998456<<====:;96/56);B?@ABEHB@@@@BGED000018721125=<<=D9888978SHIHKDECCFHSMNJPSOIJNILIMJSG?IJIGJIGIISLSKNJIRSNQMKHMRKIPSSLJJLLMJEDFELLOSMLLNKPLSSLPSSKJHKSH?SJMO?????65556CEBCE@@IOHJOMSSMSOOOSNLLLJKSRSJSHMLLJHJISKKLKKPOKMIIH@@@ADSIICBCGEE:8888<?>CCEHKILMHMRLLPRSPJILINPKOLNOJRNOKPSGGJGJIMMIHNSPSSSJJMIKKSRLSIISJLSSSLSJSOSLIIJHHISSQMSKLIIIKJLHJHEEFHSNOSHKHKPHHJJH98<<0//=/ADEFRMMRHSJLLSIMSLIIJGLOPKSNIKMOOIOQKIKSSLPSKLQQJHMJJIHKJLIHHJIISILIJ87787>>=.+**&'&&&(,20(('%%%%%'336;FLFIGGJFFHHSFHGFGKLJRJOGGIDGMKJIEIGFFEBGGBA897676'&&&$

GTGTATGTACTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAAGAGCCTCTCATTGTCCGTTCTCTACAGCACCTGTTTCCCAGTCACGATCTTCGCTTTAATACTGTCTTCCTTTACTATTGTTGGGGAACCTACTCTAAAGGAAGACAGTATTAAGAGAAGATCTTGGATATACCCAATAGTTGAAGATGGCTATACGAAAATTCAATATAAAACTCAAGGAATGCTGAAACCAATAGAATAATGAAGCACATAGTTCCATGCCCAAAATGTGAATCTCTTTCTATCGAGGTTGTTTGTATTAAATATGATGAAGAAAATTGTGCTGTAAGAAGGAAATTATGTGCGTGATTGTGAACATAGGTTTTACACGCTGCAATACCCCGAAGCCATTGTATCTAATAGGGAAATTAAATATACAAAAGGTGGAAGAACAGTTATTGTTTTAGATTTATAAAGAAAACGTTAAGCAACTCAGGAGAGTGTAAAATAAAGTTCTGAAACCCTAACATTCATGACTCACGAACACGGCACAATCAAATGCAGAATTATTAGCGACTGTAAGAGAGGTCTCGAAGCTATAATGATCGATCTAGCTGATCGTGACTGGGAAACAGGTGCTGTAGAGAACGGACAATGAGAGGCTCTTAACCTTAGCAATACG

mvheetve commented 2 months ago

Hej @lucyintheskyzzz,

thanks for sharing. Are you sure the problem lies with minimap2? Minimap2 does not output fastqs but sam files, and looking at the jobscript you shared, your script has a step Checkpoint 3where awkand grep are used to filter specific read IDs (all reads with *in the third column of the minimap2 output) from the original fastq that was used as input for minimap2. Judging by the code, grep only removes the read IDs (so the first of the four lines that constitute a read) from your fastq. This would indeed leave you with an output like the one you've shared.

Have a look at the grepmanual at https://man7.org/linux/man-pages/man1/grep.1.html and more specifically the "Context Line Control" options. I think this might fix your problem.

Best Mattias

katievigil commented 2 months ago

Hi @mvheetve thank you for your comment. I am still a novice at coding. I actually have all the .sam files from the minimap2 output. I was trying to figure out the best way to extract the samples that did not align with my blank reads and convert them to .fastq files for downstream annotation analysis. I will check out the grep manual thanks for sharing! Also, if you could think of a better way to do this please share! Thank you for your generosity in looking into my issue deeper!

mvheetve commented 2 months ago

No worries, we are all here to learn and help. I've been on your end of the conversation a million times and probably will be a million more :)

Have a look at this https://www.biostars.org/p/172737/, it discusses exactly what you're already working on. Grep is an efficient way to remove lines from text files (which is all sam and fastq files are). Removing the reads from your sam has the benefit that in a sam file, each read is only one line instead of four like in fastq files. So you could remove the reads from the sam and then use samtools fastq (http://www.htslib.org/doc/samtools-fasta.html) to convert the sam to a fastq.

Regards M