hsgweon / pipits

Automated pipeline for analyses of fungal ITS from the Illumina
GNU General Public License v3.0
30 stars 16 forks source link

Files not merged after pispino_seqprep #53

Closed bea95dc closed 1 year ago

bea95dc commented 2 years ago

Hi!

First, thanks for developing this tool, I think it will be brilliant if I get it work.

I am processing more than 2000 PE FASTQ files and everything was working until the step of merging and producing the final out_seqprep/prepped.fasta file after running pispino_seqprep. This is the error I get:

**2022-06-29 17:36:22 ... done 2022-06-29 17:36:22 Joining paired-end reads [VSEARCH] 2022-06-29 17:36:22 Joining with VSEARCH. vsearch v2.21.1_linux_x86_64, 251.8GB RAM, 64 cores https://github.com/torognes/vsearch

Merging reads

Fatal error: Invalid line 3 in FASTQ file: '+' line must be empty or identical to header 2022-06-29 17:36:22 Error: None zero returncode: vsearch --fastq_mergepairs ../out-seqprep/tmp/reindex_fastq_F/ERR3280518.fastq --reverse ../out-seqprep/tmp/reindex_fastq_R/ERR3280518.fastq --fastqout ../out-seqprep/tmp/joined/ERR3280518.fastq --threads 1 --fastq_allowmergestagger --fastq_maxdiffs 500 --fastq_minovlen 20 --fastq_minmergelen 100**

I've checked the tmp files and I've realised the in the process, the program has changed the headers of the FASTQ files, so the ">" and "+" lines no longer match, and the search cannot merge the files if that's the case. Do you know what have gone wrong or how to fix it? The original files didn't have this problem.

Hope you can help, thanks very much!

hsgweon commented 2 years ago

@bea95dc Sorry about the delay in responding. Can you try using smaller number (a subset) of sequences and see if that works to locate the sequences that may be causing this? Or, why don't you send me the file in a zipped format (provided that it's not huge!) so that I can have a go? My email details should be visible in the main page.

bea95dc commented 2 years ago

Hi @hsgweon, thanks for your reply! I tried a different, and smaller, set of fast sequences and the problem solved itself! I will try to run pipits on a smaller sample size to avoid this problem. However, I'm now encountering another problem, as I am running _pipitsfunits with the prepped.fasta and I get no ITS sequences in the resulting pasta file. I don't get any errors either, so I'm not sure what can be going wrong. Should I log another issue for this problem and close this? Thanks very much!

hsgweon commented 2 years ago

@bea95dc Can you send me some sequences from your prepped.fasta?

bea95dc commented 2 years ago

Hi @hsgweon

Sorry for my late reply. Here you have the output of "head prepped.fasta":

ASITSA10_1 GTAGGTGAACCTGCGGAAGGATCATTACAGTATTCTTTTTGCCAGCGCTTAATTGCGCGGCGAAAAAACCTTACACACAGTGTTTTTTGTTATTACAAGAACTTTTGCTTTGGTCTGGACTAGAAATAGTTTGGGCCAGAGGTTTACTGAACTAAACTTCAATATTTATATTGAATTGTTATTTATTTAATTGTCAATTTGTTGATTAAATTCAAAAAATCTTCAAAACTTTCAACAACG ASITSA10_5 GTAGGTGAACCTGCGGAAGGATCATTACTAGAGCAAAGGATAGGCAGCGCCCCACCGAAGCTTGCTTCGTGGGGTGTCGAGCCGTCGACCCTCTCGGAGAAGGTCGGTCCTGAACTCCACCCTTGAATAAATTACCTTTGTTGCTTTGTCGGGCCGCCTCGCGCCAGCGGCTTCGGCTGTTGAGTGCCCGCCAGAGGACCACAACTCTTGTTTTTAGTGATGTCTGAGTACTATATAATAGTTAAAACTTTCAACAACG ASITSA10_7 GTAGGTGAACCTGCGGAAGGATCATTACCTAGAGTTTGTAGACTTCGGTCTGCTACCTCTTACCCATGTCTTTTGAGTACCTTCGTTTCCTCGGCGGGTCCGCCCGCCGATTGGACAACATTCAAACCCTTTGCAGTTGCAATCAGCGTCTGAAAAAACATAATAGTTACAACTTTCAACAACG ASITSA10_8 GTAGGTGAACCTGCGGAAGGATCATTACCTAGAGTTTGTAGACTTCGGTCTGCTACCTCTTACCCATGTCTTTTGAGTACCTTCGTTTCCTCGGCGGGTCCGCCCGCCGATTGGACAACATTCAAACCCTTTGCAGTTGCAATCATCGTCTGAAAAAACATAATAGTTACAACTTTCAACAACG ASITSA10_9 GTAGGTGAACCTGCGGAAGGATCATTACCTAGAGTTTGTAGACTTCGGTCTGCTACCTCTTACCCATGTCTTTTGAGTACCTTCGTTTCCTCGGCGGGTCCGCCCGCCGATTGGACAACATTCAAACCCTTTGCAGTTGCAATCAGCGTCTGAAAAAACATAATAGTTACAACTTTCAACAACG

Hope you can help me with it!

I also have a question for you, related with my original problem with pispino_seqprep: does PIPITS work with FASTQ files where the "+" line of the input FASTQ files is not empty? I've realised that your test FASTQ files and the ones that have worked in my dataset don't have any content in the "+" line, but the ones that don't work have the same ">" and "+" headlines. I know this is not custom to add in FASTQ files anymore, but some of my sequences come from the SRA database. Thanks so much for your help!

hsgweon commented 2 years ago

Can you actually email me your FASTA file (you can send me a subset) rather than copying and pasting? (What you copied and pasted above is not in a FASTA format BTW).

bea95dc commented 2 years ago

Sorry, the “>” start disappeared after pasting it in the comment. I have attached here the first 500 sequences of the file (it contains millions, that’s the reason for only picking those).

Thanks for your help!

From: H. Soon Gweon @.> Date: Tuesday, 12 July 2022 at 14:38 To: hsgweon/pipits @.> Cc: beatriz.delgado.corrales @.>, Mention @.> Subject: Re: [hsgweon/pipits] Files not merged after pispino_seqprep (Issue #53) CAUTION: This email originated from outside of the University. Do not click links or open attachments unless you recognise the sender and know the content is safe.

Can you actually email me your FASTA file (you can send me a subset) rather than copying and pasting? (What you copied and pasted above is not in a FASTA format BTW).

— Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhsgweon%2Fpipits%2Fissues%2F53%23issuecomment-1181772433&data=05%7C01%7Cbeatriz.corrales%40northumbria.ac.uk%7Ce911f58375be4f1738bf08da640bcfdf%7Ce757cfdd1f354457af8f7c9c6b1437e3%7C0%7C0%7C637932299134145473%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fkpxOFpqefPq3QwH1d0QnI6Zd3XNag%2BLbJO21sYAKpQ%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAZ3TX5MGAB5QFURO5UBZGE3VTVYNPANCNFSM52JVW6JA&data=05%7C01%7Cbeatriz.corrales%40northumbria.ac.uk%7Ce911f58375be4f1738bf08da640bcfdf%7Ce757cfdd1f354457af8f7c9c6b1437e3%7C0%7C0%7C637932299134145473%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dsaNA4e%2FzPCLZjN%2BoxK2u8alWh76DwAdZ%2ByvnhVDpwc%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

This message is intended solely for the addressee and may contain confidential and/or legally privileged information. Any use, disclosure or reproduction without the sender’s explicit consent is unauthorised and may be unlawful. If you have received this message in error, please notify Northumbria University immediately and permanently delete it. Any views or opinions expressed in this message are solely those of the author and do not necessarily represent those of the University. Northumbria University email is provided by Microsoft Office365 and is hosted within the EEA, although some information may be replicated globally for backup purposes. The University cannot guarantee that this message or any attachment is virus free or has not been intercepted and/or amended.

hsgweon commented 2 years ago

Can you please send the file to my email address which you can find in the main page (https://github.com/hsgweon)?