Open teixeiramm opened 3 years ago
This method was modified/fixed for a similar issue in Sept 2020, https://github.com/nextgenusfs/funannotate/commit/2a4674964e207ebc9885abe831153115b4e6766e. Please try most recent version and let me know if you are getting the same error.
I have the exactly same error, using funannotate v1.8.9. I also used the suggested command fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files Please let me how to solve this. thanks!
If you have same error above raise ValueError("Sequence and quality captions differ.")
hopefully that makes sense, but the 3rd line should just not have a header, ie it should just be +
. You could fix the above with something like sed
sed 's/+SRR9289321.3.*$/+/g'
Yes! Indeed the third line is incorrect. thanks! : )
Hi Jon and Jason,
I am still having the same problem and was not able to fix that problem of my fastq. I am trying to train a Sarocladium genome with mRNA fastq reads and even downloading the fastq reads using JS protocol I am still having trouble.
I am using the following command (funannotate 1.8):
funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq --right SRR9289321_2.fastq --species "Sarocladium aurantiaca" --isolate CA16 --stranded FR --jaccard_clip --cpus $CPU
My fastq reads looks like this: @1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150
@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150
@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150 FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:
The error is still the same:
[Dec 08 02:42 PM]: Running 1.8.10
Traceback (most recent call last):
File
"/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate",
line 8, in
sys.exit(main())
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 710, in main
mod.main(arguments)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main
lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix
for read1, read2 in zip(file1, file2):
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.
I appreciate your help on this
Em ter., 30 de nov. de 2021 às 19:29, Xiaoqian @.***> escreveu:
Yes! Indeed the third line is incorrect. thanks! : )
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-983077198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWWP4L4D2Z2DALZS6O3UOVF4JANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***
I think it is the 3rd line or the second header for each record. So change the 3rd line to just a + with no other information.
Jon, I feel ashamed but could you please send me the sed command that would work for this? My skills are very limited. I appreciate it.
Em qui., 9 de dez. de 2021 às 12:36, Jon Palmer @.***> escreveu:
I think it is the 3rd line or the second header for each record. So change the 3rd line to just a + with no other information.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-989965336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWUDAHU3FNOBQHFCPBTUQDEJTANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***
Jason E Stajich, PhD Professor, Dept of Microbiology and Plant Pathology Chair, Academic Senate, Riverside Division University of California, Riverside
Fellow, CIFAR Fungal Kingdom: Threats and Opportunities https://www.cifar.ca/research/program/fungal-kingdom email: @.*** twitter: @stajichlab http://twitter.com/stajichlab @hyphaltip http://twitter.com/hyphaltip @zygolife http://twitter.com/zygolife website: http://lab.stajich.org office: +1 951.827.2363 mobile: +1 909.333.6709
On Thu, Dec 9, 2021 at 1:34 AM Marcus Teixeira @.***> wrote:
Hi Jon and Jason,
I am still having the same problem and was not able to fix that problem of my fastq. I am trying to train a Sarocladium genome with mRNA fastq reads and even downloading the fastq reads using JS protocol I am still having trouble.
I am using the following command (funannotate 1.8):
funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq --right SRR9289321_2.fastq --species "Sarocladium aurantiaca" --isolate CA16 --stranded FR --jaccard_clip --cpus $CPU
My fastq reads looks like this: @1/1
NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150
FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF
@2/1
NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF
@3/1
CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150
FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:
The error is still the same:
[Dec 08 02:42 PM]: Running 1.8.10
Traceback (most recent call last):
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate", line 8, in
sys.exit(main())
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 710, in main
mod.main(arguments)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main
lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix
for read1, read2 in zip(file1, file2):
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.
I appreciate your help on this
Em ter., 30 de nov. de 2021 às 19:29, Xiaoqian @.***> escreveu:
Yes! Indeed the third line is incorrect. thanks! : )
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-983077198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWWP4L4D2Z2DALZS6O3UOVF4JANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***
Untested, but sed for this dataset might be:
sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq
sed 's/^+SRR9289321.*$/+/g' R2 > new_R2.fq
The read files look like this now. But still having the same error.😶 @1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +
@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +
@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC + FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:
Em qui., 9 de dez. de 2021 às 15:21, Jon Palmer @.***> escreveu:
Untested, but sed for this dataset might be:
sed 's/^+SRR9289321.$/+/g' R1 > new_R1.fq sed 's/^+SRR9289321.$/+/g' R2 > new_R2.fq
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-990106663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWTNDH6RDCAMUYCVTP3UQDXSBANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***
Hey Jon,
I'm having the same problem as this. I downloaded from SRA using the command on the Trinity site that you suggested. I'm using the latest version of Funannotate and I even tried the sed command you suggested above "sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq" modified for my dataset of course. My final files look like the above as well, with the third line header a simple + symbol. I also tried the suggestion from Jason to use BBtools but that didn't work either.
I know the last message on here was almost a year ago! So maybe this got resolved and never updated. But hey, I thought I'd ask. Any thoughts on this?
I am still having issues with this.
On Sun, 28 Aug 2022 at 21:40 Chris Smith @.***> wrote:
Hey Jon,
I'm having the same problem as this. I downloaded from SRA using the command on the Trinity site that you suggested. I'm using the latest version of Funannotate and I even tried the sed command you suggested above "sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq" modified for my dataset of course. My final files look like the above as well, with the third line header a simple + symbol. I also tried the suggestion from Jason to use BBtools but that didn't work either.
I know the last message on here was almost a year ago! So maybe this got resolved and never updated. But hey, I thought I'd ask. Any thoughts on this?
— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-1229603228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWR4FYP2DHASH4GKFQLV3QBI3ANCNFSM5CXM367Q . You are receiving this because you authored the thread.Message ID: @.***>
-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***
Can you repost how you are doing this to make it more reproducible again? The SRA file perhaps and the command? I think just concerted effort to get these headers fixed in systematic way needs a little attention. Or is it just the stuff at the top?
SRA download command: fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR13498932
The raw output from that doesn't work. Following which I tried:
BBtools repair.sh Which resulted in output that looked like this but still didn't work. @1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA +1/1 AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC +2/1 AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @3/1
Then I tried:
sed 's/^+SRR13498932.$/+/g' R1 > new_R1.fq sed 's/^+SRR13498932.$/+/g' R2 > new_R2.fq
Which resulted in output that looked like this but still didn't work.
head new_R1.fastq @1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA + AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC + AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @3/1 GGGATGCCTGGCATGGACGGCAGGGGTGGCGCCGACGCTAGCCAGTTCATGCGCGACGTCGCCGTGGTTATGCGCCGCCGGCACGACGACGGCGAGCTGACATCACAGCAGATCGAGGAGATCATTGCCGGGCTCCCGCGGCTGTCGGAG
Its probably SRA issue -- just get the raw data from EBI:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_2.fastq.gz
BTW this tool will save you a lot of time... https://sra-explorer.info/#
I suspect the raw data will work right away unless they already modified to some non-standard Illumina header.
That appears to have done the job, thanks man!
----Edit----
I spoke too soon. I'm now getting this from the funannotate-train.log
R1 header: SRR13498932.1 1/1 and R2 header: SRR13498932.1 1/1 are not 1 and 2 as expected [08/31/22 22:24:39]: ERROR: FASTQ headers are not properly paired, see logfile and reformat your FASTQ headers
It threw that with the raw data from EBI that you suggested, then I thought I'd try repairing with BBTools repair.sh and still get the same error. Here is what the repaired fastq files look like
head SRR13498932_repaired ==> SRR13498932_1.repaired.fastq <== @SRR13498932.1 1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA + AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @SRR13498932.2 2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC + AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @SRR13498932.3 3/1 GGGATGCCTGGCATGGACGGCAGGGGTGGCGCCGACGCTAGCCAGTTCATGCGCGACGTCGCCGTGGTTATGCGCCGCCGGCACGACGACGGCGAGCTGACATCACAGCAGATCGAGGAGATCATTGCCGGGCTCCCGCGGCTGTCGGAG
==> SRR13498932_2.repaired.fastq <== @SRR13498932.1 1/2 NACCGCCCGCGCCCGTAGCGAGCACGATCGTGCACCTGAGCGTGCGCGCGGGGCCGCCGCCCGCGGAAGACGGGTGGACGAAGACAAGGAGGACGAAGGCCGCGGCGGCGGAGGCGGGCGCGGGCGCGGGGGAGGTGGGCGCCGAGGGCG + !AFFFJ-JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ-J7JJJJ77JJ-JJJJJ-J-J-JJJJ7JJJJJ7JJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJ @SRR13498932.2 2/2 NTTTACCTCAGATTCTCTCTGCGCTTTCGTCGCTTGAATCAGAAGAGGCCGAGTTATCGGCATCATTGCAGGATCTGCTCGCGTCTCAGGATCCCATCCTTGCGTCTCTTTCTCGGCTCCAGGACCTATCGCCTCGACTAGATGACTTGC + !AFFFJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJ @SRR13498932.3 3/2
How about number of reads in each file? Oh, that's too bad that EBI didn't keep the original Illumina headers.
They look like theyve got the same number of reads in each. Here is the tail
SRR13498932_1.repaired.fastq <== + <AA<FF--FFF-7-F77FFF-7F77-7FFFF77FFFFFF-FF-F7F-FFFFFFFF7FFFFF-F7-F7F-FFFF-F7-7-FFFF-FFFF-F7FFFFF77FF77F-FFFFFFFFFF7FFF-FFFFFFFFFFF--FFFFF<<FFFFFF-FFFF @SRR13498932.31908094 31908094/1 TTGCTATCCAATGAGGCATAACCTCACCTACGCTCAACGACATCCCCGACAGACGTGTTTTGACCGAGAGAACGGTCTTATATCCTCGATCGCCAAAGACTATCGGGAACAGCTCGCTCAAGATGTCTACGACGCTGGATTATGCCTCAT + AAAFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFF-FFFFFFFFFFFFF7FFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR13498932.31908095 31908095/1 CTGAGGTATAGGACAGCCTTCCACTCGATCCAGTTGAGCGGGGTAATCGCGAACAGCGCGGTGAAGACAGGGATATACAAGATGGCAAAGTGAAGACCCATGCTCAGCGCGATCGCAGCGACGAGGAAGGGGTTCTTCCACAGCGGGAGG + AAAFFFF7FFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFF
==> SRR13498932_2.repaired.fastq <== + -A-<FJJ-JJ7JJJJJJ7-JJ-JJ7--JJJ7JJJJJ7J7JJ-77J7-J--JJJJJJ-7JJ-7-JJJ-JJ-JJJJJJJJJ7-----7JJJJ-J-7J-JJ-7JJ-JJJJ-7-J7J-JJ--JJJJ7-JJJJ--7J7JJJJJ-JJ7JJ--J-JJ @SRR13498932.31908094 31908094/2 GCAGGGGCTTTCGTCGAGATGGAGGAAAATCCGGGTGGTGGAGGTAGACTCCGGTACTTGCTATCCAATGAGGCATAATCCAGCGTCGTAGACATCTTGAGCGAGCTGTTCCCGATAGTCTTTGGCGATCGAGGATATAAGACCGTTCTC + AAFFFJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJ @SRR13498932.31908095 31908095/2 CCAACTGACGCACTTCCACCCATGTGCCTCGCAATTCCCCGAGATTGGCTGTGAGATGTTCACGAACGCGATGGCGCAGCGCGCGACGACTGTCAGCCTGTCCATCCTCGTCACCGTCGAAATGTTCAACGCTATGAACTCGCTCTCCGA + AAFFFJJ-JJJJJJ7JJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJ7J7JJJJJ-JJJ7JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJ
Okay, sorry this is not working great. Its a requirement of Trinity and not of funannotate
so I tried to parse the headers early on to ensure it will work. I can of course replicate this locally:
# download reads
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_2.fastq.gz
# check that all is well
$ seqkit stats SRR13498932_*
file format type num_seqs sum_len min_len avg_len max_len
SRR13498932_1.fastq.gz FASTQ DNA 31,908,095 4,786,214,250 150 150 150
SRR13498932_2.fastq.gz FASTQ DNA 31,908,095 4,786,214,250 150 150 150
# subsampled to 10k reads and try in funannotate train
$ funannotate train -i test.softmasked.fa --stranded RF -l ../Downloads/SRR13498932x_1.fastq.gz -r ../Downloads/SRR13498932x_2.fastq.gz -o fake_test --cpus 6
[09/01/22 09:08:20]: Quality trimmed reads: ('fake_test/training/trimmomatic/trimmed_left.fastq.gz', 'fake_test/training/trimmomatic/trimmed_right.fastq.gz', None)
[09/01/22 09:08:20]: R1 header: SRR13498932.1 1/1 and R2 header: SRR13498932.1 1/2 are not 1 and 2 as expected
[09/01/22 09:08:20]: ERROR: FASTQ headers are not properly paired, see logfile and reformat your FASTQ headers
So the logic in the FASTQCheckandFix function is:
So what is happening here is that there is a space and it ends with /1 or /2 -- this will apparently fail in Trinity.... So we need to make the headers look like one of these two options.
Headers look like this from the EBI:
@SRR13498932.1 1/1
So we can fix these with sed to remove the space and the duplicated read number, ie something like:
pigz -dc SRR13498932x_1.fastq.gz | sed 's,[[:space:]][[:digit:]]*,,g' | pigz > left_reads.fastq.gz
pigz -dc SRR13498932x_2.fastq.gz | sed 's,[[:space:]][[:digit:]]*,,g' | pigz > right_reads.fastq.gz
Now the headers look like this:
@SRR13498932.1/1
CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA
+
AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7
@SRR13498932.1/2
NACCGCCCGCGCCCGTAGCGAGCACGATCGTGCACCTGAGCGTGCGCGCGGGGCCGCCGCCCGCGGAAGACGGGTGGACGAAGACAAGGAGGACGAAGGCCGCGGCGGCGGAGGCGGGCGCGGGCGCGGGGGAGGTGGGCGCCGAGGGCG
+
#AFFFJ-JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ-J7JJJJ77JJ-JJJJJ-J-J-JJJJ7JJJJJ7JJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJ
You also need to remove the temp fastq files in your output directory if they already exist they will be re-used.
The fastq-dump
method should work -- the error there is actually funannotates problem as I'm using biopython to parse the fastq files and it apparently chokes when there is a quality score header. So we should just replace the FASTQ parser here and that will be fixed -- I'll put this on my list for future fixes.
That's done it, it's progressed through to read normalization now. Thank you so much for taking the time to dig into this and figure it out.
Are you using the latest release? funannotate-1.8
Describe the bug I am trying to train funannotate with a set of mRNA fastq data downloaded from SRA with the following command: fastq-dump.2.10.8 --gzip --defline-seq '@$sn[_$rn]/$ri' --split-files SRR9289321 But I am still getting the error as described below. The reads downloaded looks like this:
@1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150
FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF
@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF
@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150 FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F: @4/1 GTAAGAAGCTCTTGTTACTTAATTGAACGTGGGCATTCGAATGTATCAACACTAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAACGCGAGGTTAAGGTGCCAGAGTAGACGCTCATCAGACACCACAAAAGGA +SRR9289321.4 4 length=150 FFFFFFFFFFF:F:FFF,:FFFFF:F:,F::FFFF:FF,FF::FFFF:,FFFFFF:FF:FFFF:FF:F:FFFF:FF:FFF,F,FFF,FFFFFFFFFF,FFF,FFFFFF:FFFF:F:FFF:FFF,,FFFFF::FFFFFFFFFFFFFFF,,,
What command did you issue? funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq.gz --right SRR9289321_2.fastq.gz --species "Sarocladium aurantiaca" --isolate CA16 --stranded RF --jaccard_clip --cpus $CPU
Logfiles Loading funannotate/1.8 Loading requirement: miniconda3/4.3.31 jellyfish/2.3.0 lp_solve/5.5 EVM/1.1.1-live genemarkESET/4.62_lic bamtools/2.4.0 hmmer/3 gmap/2019-03-04 bedtools/2.29.2 tbl2asn/25.8 boost/1.68.0 augustus/3.3.3 diamond/2.0.8 exonerate/2.2.0 ncbi-blast/2.2.31+ cdbfasta/06272015 htslib/1.11 samtools/1.11 braker/2.1.4 mummer/4.0 RAxML/8.2.12 mafft/7.427 trimal/1.4.1 java/13.0.1 trimmomatic/0.36 exonerate/2.4.1 blat/35 tRNAscan/1.3.1 proteinortho/6.0.20 kent/318 signalp/5.0b fasta/36.3.8h bowtie2/2.4.1 salmon/0.9.1 java/13 trinity-rnaseq/2.11.0 hisat2/2.2.1 kallisto/0.46.2 gmap/2017-11-15 PASA/2.4.1 minimap2/2.20 htslib/1.12 samtools/1.12 stringtie/2.1.1 CodingQuarry/2.0 [Aug 03 08:02 AM]: OS: CentOS Linux 7, 64 cores, ~ 528 GB RAM. Python: 3.8.5 [Aug 03 08:02 AM]: Running 1.8.8 [Aug 03 08:02 AM]: Adapter and Quality trimming PE reads with Trimmomatic
Traceback (most recent call last): File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate", line 8, in
sys.exit(main())
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 705, in main
mod.main(arguments)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main
lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus)
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix
for read1, read2 in zip(file1, file2):
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ
OS/Install Information
funannotate check --show-versions