nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
317 stars 83 forks source link

ValueError: Sequence and quality captions differ. - Train step #634

Open teixeiramm opened 3 years ago

teixeiramm commented 3 years ago

Are you using the latest release? funannotate-1.8

Describe the bug I am trying to train funannotate with a set of mRNA fastq data downloaded from SRA with the following command: fastq-dump.2.10.8 --gzip --defline-seq '@$sn[_$rn]/$ri' --split-files SRR9289321 But I am still getting the error as described below. The reads downloaded looks like this:

@1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150

FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF

@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150

FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF

@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150 FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F: @4/1 GTAAGAAGCTCTTGTTACTTAATTGAACGTGGGCATTCGAATGTATCAACACTAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAACGCGAGGTTAAGGTGCCAGAGTAGACGCTCATCAGACACCACAAAAGGA +SRR9289321.4 4 length=150 FFFFFFFFFFF:F:FFF,:FFFFF:F:,F::FFFF:FF,FF::FFFF:,FFFFFF:FF:FFFF:FF:F:FFFF:FF:FFF,F,FFF,FFFFFFFFFF,FFF,FFFFFF:FFFF:F:FFF:FFF,,FFFFF::FFFFFFFFFFFFFFF,,,

What command did you issue? funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq.gz --right SRR9289321_2.fastq.gz --species "Sarocladium aurantiaca" --isolate CA16 --stranded RF --jaccard_clip --cpus $CPU

Logfiles Loading funannotate/1.8 Loading requirement: miniconda3/4.3.31 jellyfish/2.3.0 lp_solve/5.5 EVM/1.1.1-live genemarkESET/4.62_lic bamtools/2.4.0 hmmer/3 gmap/2019-03-04 bedtools/2.29.2 tbl2asn/25.8 boost/1.68.0 augustus/3.3.3 diamond/2.0.8 exonerate/2.2.0 ncbi-blast/2.2.31+ cdbfasta/06272015 htslib/1.11 samtools/1.11 braker/2.1.4 mummer/4.0 RAxML/8.2.12 mafft/7.427 trimal/1.4.1 java/13.0.1 trimmomatic/0.36 exonerate/2.4.1 blat/35 tRNAscan/1.3.1 proteinortho/6.0.20 kent/318 signalp/5.0b fasta/36.3.8h bowtie2/2.4.1 salmon/0.9.1 java/13 trinity-rnaseq/2.11.0 hisat2/2.2.1 kallisto/0.46.2 gmap/2017-11-15 PASA/2.4.1 minimap2/2.20 htslib/1.12 samtools/1.12 stringtie/2.1.1 CodingQuarry/2.0 [Aug 03 08:02 AM]: OS: CentOS Linux 7, 64 cores, ~ 528 GB RAM. Python: 3.8.5 [Aug 03 08:02 AM]: Running 1.8.8 [Aug 03 08:02 AM]: Adapter and Quality trimming PE reads with Trimmomatic

Traceback (most recent call last): File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate", line 8, in sys.exit(main()) File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 705, in main mod.main(arguments) File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus) File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix for read1, read2 in zip(file1, file2): File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator raise ValueError("Sequence and quality captions differ.") ValueError: Sequence and quality captions differ

OS/Install Information

nextgenusfs commented 3 years ago

This method was modified/fixed for a similar issue in Sept 2020, https://github.com/nextgenusfs/funannotate/commit/2a4674964e207ebc9885abe831153115b4e6766e. Please try most recent version and let me know if you are getting the same error.

Cindy0805 commented 2 years ago

I have the exactly same error, using funannotate v1.8.9. I also used the suggested command fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files Please let me how to solve this. thanks!

nextgenusfs commented 2 years ago

If you have same error above raise ValueError("Sequence and quality captions differ.") hopefully that makes sense, but the 3rd line should just not have a header, ie it should just be +. You could fix the above with something like sed

sed 's/+SRR9289321.3.*$/+/g' 
Cindy0805 commented 2 years ago

Yes! Indeed the third line is incorrect. thanks! : )

teixeiramm commented 2 years ago

Hi Jon and Jason,

I am still having the same problem and was not able to fix that problem of my fastq. I am trying to train a Sarocladium genome with mRNA fastq reads and even downloading the fastq reads using JS protocol I am still having trouble.

I am using the following command (funannotate 1.8):

funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq --right SRR9289321_2.fastq --species "Sarocladium aurantiaca" --isolate CA16 --stranded FR --jaccard_clip --cpus $CPU

My fastq reads looks like this: @1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150

FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF

@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150

FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF

@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150 FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:

The error is still the same:

[Dec 08 02:42 PM]: Running 1.8.10


Traceback (most recent call last):

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate", line 8, in

sys.exit(main())

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 710, in main

mod.main(arguments)

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main

lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus)

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix

for read1, read2 in zip(file1, file2):

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator

raise ValueError("Sequence and quality captions differ.")

ValueError: Sequence and quality captions differ.

I appreciate your help on this

Em ter., 30 de nov. de 2021 às 19:29, Xiaoqian @.***> escreveu:

Yes! Indeed the third line is incorrect. thanks! : )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-983077198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWWP4L4D2Z2DALZS6O3UOVF4JANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***

nextgenusfs commented 2 years ago

I think it is the 3rd line or the second header for each record. So change the 3rd line to just a + with no other information.

teixeiramm commented 2 years ago

Jon, I feel ashamed but could you please send me the sed command that would work for this? My skills are very limited. I appreciate it.

Em qui., 9 de dez. de 2021 às 12:36, Jon Palmer @.***> escreveu:

I think it is the 3rd line or the second header for each record. So change the 3rd line to just a + with no other information.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-989965336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWUDAHU3FNOBQHFCPBTUQDEJTANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***

teixeiramm commented 2 years ago

you might try bbtools repair.sh https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/

Jason E Stajich, PhD Professor, Dept of Microbiology and Plant Pathology Chair, Academic Senate, Riverside Division University of California, Riverside

Fellow, CIFAR Fungal Kingdom: Threats and Opportunities https://www.cifar.ca/research/program/fungal-kingdom email: @.*** twitter: @stajichlab http://twitter.com/stajichlab @hyphaltip http://twitter.com/hyphaltip @zygolife http://twitter.com/zygolife website: http://lab.stajich.org office: +1 951.827.2363 mobile: +1 909.333.6709

On Thu, Dec 9, 2021 at 1:34 AM Marcus Teixeira @.***> wrote:

Hi Jon and Jason,

I am still having the same problem and was not able to fix that problem of my fastq. I am trying to train a Sarocladium genome with mRNA fastq reads and even downloading the fastq reads using JS protocol I am still having trouble.

I am using the following command (funannotate 1.8):

funannotate train -i CA16.masked.fasta -o fun --left SRR9289321_1.fastq --right SRR9289321_2.fastq --species "Sarocladium aurantiaca" --isolate CA16 --stranded FR --jaccard_clip --cpus $CPU

My fastq reads looks like this: @1/1

NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +SRR9289321.1 1 length=150

FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF

@2/1

NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +SRR9289321.2 2 length=150

FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF

@3/1

CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC +SRR9289321.3 3 length=150

FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:

The error is still the same:

[Dec 08 02:42 PM]: Running 1.8.10


Traceback (most recent call last):

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/bin/funannotate", line 8, in

sys.exit(main())

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/funannotate.py", line 710, in main

mod.main(arguments)

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/train.py", line 980, in main

lib.CheckFASTQandFix(trim_reads[0], trim_reads[1], cpus=args.cpus)

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/funannotate/library.py", line 593, in CheckFASTQandFix

for read1, read2 in zip(file1, file2):

File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/envs/funannotate-1.8/lib/python3.8/site-packages/Bio/SeqIO/QualityIO.py", line 950, in FastqGeneralIterator

raise ValueError("Sequence and quality captions differ.")

ValueError: Sequence and quality captions differ.

I appreciate your help on this

Em ter., 30 de nov. de 2021 às 19:29, Xiaoqian @.***> escreveu:

Yes! Indeed the third line is incorrect. thanks! : )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-983077198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWWP4L4D2Z2DALZS6O3UOVF4JANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***

nextgenusfs commented 2 years ago

Untested, but sed for this dataset might be:

sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq
sed 's/^+SRR9289321.*$/+/g' R2 > new_R2.fq
teixeiramm commented 2 years ago

The read files look like this now. But still having the same error.😶 @1/1 NTCAAGAAATGTTGTCAATGCCGATGTCGAAGATAAAAAAGAAAAAACAGCGTTAGCTTGTCGTTTTGCATGTCCGTCAGATCCCAGAGAAAGCGAAGAAAAAGTCTATACAGCCCGAAGATGTAGATTTGTGCCGCTGTGCAAAATAGA +

FFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,:FFFFF:FFFFFFFFFFFF:FFF:FFF:FFF,FFFFF,FFFFF::F::F:FFFFFF:FF:FF,FFF:F::FFFFFFFF:,FFFFFF:FFFFFFF:FFFFFFF,,:,:FFFF

@2/1 NTTCTCGGTATCACAGGCAAGGACTTCACCTTGATCGCCGCATCGAAAGCGGCGATGCGAGGTGCCACTATCCTTAAAGCTTCCGACGACAAGACAAGAGCCCTCAACAAGAACACTCTCCTGGCGTTCTCAGGAGAAGCCGGTGACACA +

FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFF

@3/1 CTAACATATTTAAAGTTTGAGAATGGATGAAGGCTAAATAGCGCCCCCGGGTCCCTAAACATTAGCTTTACCTCAAAAAACTGAGTTAAAAAATGCTATACTGAGAGATAGGAAGAGCACACGTCTGAACTCCACTCAAGGTACCATATC + FF,:FFFF,FFFFF:FF:,F::F,FF:,FFF,F,FFFFFF:::,F::FF,FFF:F,FF,,F,,,,:F:F:F,FFF,FFFFFFF:,FF,F:,F,F,,FF,:,F,,::,FF,,:,FFF,,FFF:,,F,,FF,F,:F,F,F,:FFF,,,F,F:

Em qui., 9 de dez. de 2021 às 15:21, Jon Palmer @.***> escreveu:

Untested, but sed for this dataset might be:

sed 's/^+SRR9289321.$/+/g' R1 > new_R1.fq sed 's/^+SRR9289321.$/+/g' R2 > new_R2.fq

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-990106663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWTNDH6RDCAMUYCVTP3UQDXSBANCNFSM5CXM367Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***

Dikaryotic commented 2 years ago

Hey Jon,

I'm having the same problem as this. I downloaded from SRA using the command on the Trinity site that you suggested. I'm using the latest version of Funannotate and I even tried the sed command you suggested above "sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq" modified for my dataset of course. My final files look like the above as well, with the third line header a simple + symbol. I also tried the suggestion from Jason to use BBtools but that didn't work either.

I know the last message on here was almost a year ago! So maybe this got resolved and never updated. But hey, I thought I'd ask. Any thoughts on this?

teixeiramm commented 2 years ago

I am still having issues with this.

On Sun, 28 Aug 2022 at 21:40 Chris Smith @.***> wrote:

Hey Jon,

I'm having the same problem as this. I downloaded from SRA using the command on the Trinity site that you suggested. I'm using the latest version of Funannotate and I even tried the sed command you suggested above "sed 's/^+SRR9289321.*$/+/g' R1 > new_R1.fq" modified for my dataset of course. My final files look like the above as well, with the third line header a simple + symbol. I also tried the suggestion from Jason to use BBtools but that didn't work either.

I know the last message on here was almost a year ago! So maybe this got resolved and never updated. But hey, I thought I'd ask. Any thoughts on this?

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/634#issuecomment-1229603228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUIMWR4FYP2DHASH4GKFQLV3QBI3ANCNFSM5CXM367Q . You are receiving this because you authored the thread.Message ID: @.***>

-- Marcus Teixeira, Ph.D. Visiting Professor at University of Brasília (UnB) Room 40, Núcleo de Medicina Tropical, Campus Universitário Darcy Ribeiro, Asa Norte, Brasília-DF Brazil - ZIP 70910-900 *Affiliated Researcher at Northern Arizona University (NAU) Applied Research & Development Building Bldg 56, 3rd Floor 1298 S Knoles Drive, Flagstaff-AZ, USA - ZIP 86011-4073 email: @., @.***

hyphaltip commented 2 years ago

Can you repost how you are doing this to make it more reproducible again? The SRA file perhaps and the command? I think just concerted effort to get these headers fixed in systematic way needs a little attention. Or is it just the stuff at the top?

Dikaryotic commented 2 years ago

SRA download command: fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR13498932

The raw output from that doesn't work. Following which I tried:

BBtools repair.sh Which resulted in output that looked like this but still didn't work. @1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA +1/1 AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC +2/1 AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @3/1

Then I tried:

sed 's/^+SRR13498932.$/+/g' R1 > new_R1.fq sed 's/^+SRR13498932.$/+/g' R2 > new_R2.fq

Which resulted in output that looked like this but still didn't work.

head new_R1.fastq @1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA + AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC + AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @3/1 GGGATGCCTGGCATGGACGGCAGGGGTGGCGCCGACGCTAGCCAGTTCATGCGCGACGTCGCCGTGGTTATGCGCCGCCGGCACGACGACGGCGAGCTGACATCACAGCAGATCGAGGAGATCATTGCCGGGCTCCCGCGGCTGTCGGAG

nextgenusfs commented 2 years ago

Its probably SRA issue -- just get the raw data from EBI:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_2.fastq.gz

BTW this tool will save you a lot of time... https://sra-explorer.info/#

I suspect the raw data will work right away unless they already modified to some non-standard Illumina header.

Dikaryotic commented 2 years ago

That appears to have done the job, thanks man!

----Edit----

I spoke too soon. I'm now getting this from the funannotate-train.log

R1 header: SRR13498932.1 1/1 and R2 header: SRR13498932.1 1/1 are not 1 and 2 as expected [08/31/22 22:24:39]: ERROR: FASTQ headers are not properly paired, see logfile and reformat your FASTQ headers

It threw that with the raw data from EBI that you suggested, then I thought I'd try repairing with BBTools repair.sh and still get the same error. Here is what the repaired fastq files look like

head SRR13498932_repaired ==> SRR13498932_1.repaired.fastq <== @SRR13498932.1 1/1 CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA + AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7 @SRR13498932.2 2/1 TAGCTCCATAACTTGTCCAACCCTGTCATTGGCCTCGCGCACTCGCCGCATTTCCTCGTCTAGTGTTCGCACTCTACTGCCCACGCGGACAGCAGTTCCTGCCGTGCTTGAGACCTTGTCTGAGACTAGAAAGGCTTCATGGTGCAAGTC + AAFFFJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJ-JJ-JJJJJJJJ7JJ7JJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJ7JJJJJJJJJJJJJJ7JJJ7JJJJJ-JJJJ-JJJJJJJJJJJJJ7JJJJJJ7JJJ @SRR13498932.3 3/1 GGGATGCCTGGCATGGACGGCAGGGGTGGCGCCGACGCTAGCCAGTTCATGCGCGACGTCGCCGTGGTTATGCGCCGCCGGCACGACGACGGCGAGCTGACATCACAGCAGATCGAGGAGATCATTGCCGGGCTCCCGCGGCTGTCGGAG

==> SRR13498932_2.repaired.fastq <== @SRR13498932.1 1/2 NACCGCCCGCGCCCGTAGCGAGCACGATCGTGCACCTGAGCGTGCGCGCGGGGCCGCCGCCCGCGGAAGACGGGTGGACGAAGACAAGGAGGACGAAGGCCGCGGCGGCGGAGGCGGGCGCGGGCGCGGGGGAGGTGGGCGCCGAGGGCG + !AFFFJ-JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ-J7JJJJ77JJ-JJJJJ-J-J-JJJJ7JJJJJ7JJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJ @SRR13498932.2 2/2 NTTTACCTCAGATTCTCTCTGCGCTTTCGTCGCTTGAATCAGAAGAGGCCGAGTTATCGGCATCATTGCAGGATCTGCTCGCGTCTCAGGATCCCATCCTTGCGTCTCTTTCTCGGCTCCAGGACCTATCGCCTCGACTAGATGACTTGC + !AFFFJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJ @SRR13498932.3 3/2

nextgenusfs commented 2 years ago

How about number of reads in each file? Oh, that's too bad that EBI didn't keep the original Illumina headers.

Dikaryotic commented 2 years ago

They look like theyve got the same number of reads in each. Here is the tail

SRR13498932_1.repaired.fastq <== + <AA<FF--FFF-7-F77FFF-7F77-7FFFF77FFFFFF-FF-F7F-FFFFFFFF7FFFFF-F7-F7F-FFFF-F7-7-FFFF-FFFF-F7FFFFF77FF77F-FFFFFFFFFF7FFF-FFFFFFFFFFF--FFFFF<<FFFFFF-FFFF @SRR13498932.31908094 31908094/1 TTGCTATCCAATGAGGCATAACCTCACCTACGCTCAACGACATCCCCGACAGACGTGTTTTGACCGAGAGAACGGTCTTATATCCTCGATCGCCAAAGACTATCGGGAACAGCTCGCTCAAGATGTCTACGACGCTGGATTATGCCTCAT + AAAFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFF-FFFFFFFFFFFFF7FFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR13498932.31908095 31908095/1 CTGAGGTATAGGACAGCCTTCCACTCGATCCAGTTGAGCGGGGTAATCGCGAACAGCGCGGTGAAGACAGGGATATACAAGATGGCAAAGTGAAGACCCATGCTCAGCGCGATCGCAGCGACGAGGAAGGGGTTCTTCCACAGCGGGAGG + AAAFFFF7FFFFF7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF7FFFFFFFFFFFFFFFFFFFFFF

==> SRR13498932_2.repaired.fastq <== + -A-<FJJ-JJ7JJJJJJ7-JJ-JJ7--JJJ7JJJJJ7J7JJ-77J7-J--JJJJJJ-7JJ-7-JJJ-JJ-JJJJJJJJJ7-----7JJJJ-J-7J-JJ-7JJ-JJJJ-7-J7J-JJ--JJJJ7-JJJJ--7J7JJJJJ-JJ7JJ--J-JJ @SRR13498932.31908094 31908094/2 GCAGGGGCTTTCGTCGAGATGGAGGAAAATCCGGGTGGTGGAGGTAGACTCCGGTACTTGCTATCCAATGAGGCATAATCCAGCGTCGTAGACATCTTGAGCGAGCTGTTCCCGATAGTCTTTGGCGATCGAGGATATAAGACCGTTCTC + AAFFFJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJ @SRR13498932.31908095 31908095/2 CCAACTGACGCACTTCCACCCATGTGCCTCGCAATTCCCCGAGATTGGCTGTGAGATGTTCACGAACGCGATGGCGCAGCGCGCGACGACTGTCAGCCTGTCCATCCTCGTCACCGTCGAAATGTTCAACGCTATGAACTCGCTCTCCGA + AAFFFJJ-JJJJJJ7JJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJ7J7JJJJJ-JJJ7JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJ

nextgenusfs commented 2 years ago

Okay, sorry this is not working great. Its a requirement of Trinity and not of funannotate so I tried to parse the headers early on to ensure it will work. I can of course replicate this locally:

# download reads
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR134/032/SRR13498932/SRR13498932_2.fastq.gz

# check that all is well 
$ seqkit stats SRR13498932_*
file                    format  type    num_seqs        sum_len  min_len  avg_len  max_len
SRR13498932_1.fastq.gz  FASTQ   DNA   31,908,095  4,786,214,250      150      150      150
SRR13498932_2.fastq.gz  FASTQ   DNA   31,908,095  4,786,214,250      150      150      150

# subsampled to 10k reads and try in funannotate train
$ funannotate train -i test.softmasked.fa --stranded RF -l ../Downloads/SRR13498932x_1.fastq.gz -r ../Downloads/SRR13498932x_2.fastq.gz -o fake_test --cpus 6
[09/01/22 09:08:20]: Quality trimmed reads: ('fake_test/training/trimmomatic/trimmed_left.fastq.gz', 'fake_test/training/trimmomatic/trimmed_right.fastq.gz', None)
[09/01/22 09:08:20]: R1 header: SRR13498932.1 1/1 and R2 header: SRR13498932.1 1/2 are not 1 and 2 as expected
[09/01/22 09:08:20]: ERROR: FASTQ headers are not properly paired, see logfile and reformat your FASTQ headers

So the logic in the FASTQCheckandFix function is:

  1. if there is a space in the header, than after the first space the next character has to be 1 (forward reads) or 2 (reverse reads). note this is standard illumina raw output.
  2. if there is no space, then the header must end with /1 and /2.

So what is happening here is that there is a space and it ends with /1 or /2 -- this will apparently fail in Trinity.... So we need to make the headers look like one of these two options.

Headers look like this from the EBI:

@SRR13498932.1 1/1

So we can fix these with sed to remove the space and the duplicated read number, ie something like:

pigz -dc SRR13498932x_1.fastq.gz | sed 's,[[:space:]][[:digit:]]*,,g' | pigz > left_reads.fastq.gz
pigz -dc SRR13498932x_2.fastq.gz | sed 's,[[:space:]][[:digit:]]*,,g' | pigz > right_reads.fastq.gz

Now the headers look like this:

@SRR13498932.1/1
CAAAAAGCAAGCAAGCAAGCACGGGACGAATAAATGCATGTAAACTACTACGCGGCGGGGGGAGGGGTGAAATCATAAAGTCGTGACGTGCGAGGAGGTATGGCGTGCATGCGGTGCGACAGGACGCGCGTCCAGCCAGAGGCGAGACGA
+
AAFFFJJJJJJJJJJJJ-JJJ-JJ77J7JJJ7JJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJJJJJJJ7JJJJJJJ7JJJJJJJJJ-JJJJ-J7JJJ7JJJJJJJJ7JJJJJJJJJJJ-JJJJJJJJJJJJJJJJ7JJJJJJ-J7JJ7

@SRR13498932.1/2
NACCGCCCGCGCCCGTAGCGAGCACGATCGTGCACCTGAGCGTGCGCGCGGGGCCGCCGCCCGCGGAAGACGGGTGGACGAAGACAAGGAGGACGAAGGCCGCGGCGGCGGAGGCGGGCGCGGGCGCGGGGGAGGTGGGCGCCGAGGGCG
+
#AFFFJ-JJJJJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJ-J7JJJJ77JJ-JJJJJ-J-J-JJJJ7JJJJJ7JJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-JJJJJ

You also need to remove the temp fastq files in your output directory if they already exist they will be re-used.

nextgenusfs commented 2 years ago

The fastq-dump method should work -- the error there is actually funannotates problem as I'm using biopython to parse the fastq files and it apparently chokes when there is a quality score header. So we should just replace the FASTQ parser here and that will be fixed -- I'll put this on my list for future fixes.

Dikaryotic commented 2 years ago

That's done it, it's progressed through to read normalization now. Thank you so much for taking the time to dig into this and figure it out.