jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
52 stars 41 forks source link

gzip and core dumping #39

Closed pdimens closed 6 years ago

pdimens commented 6 years ago

I needed to run some older Illumina reads through dDocent, and I am getting these two specific errors that are preventing the pipeline from completing:

  1. The first is when initiating trimming, this error appears
    
    Trimming reads and simultaneously assembling reference sequences
    Removing the _1 character and replacing with /1 in the name of every sequence

gzip: FRS_001.F.fq.gz: unexpected end of file

gzip: FRS_002.F.fq.gz: unexpected end of file

gzip: FRS_003.F.fq.gz: unexpected end of file

gzip: FRS_004.F.fq.gz: unexpected end of file


It's unclear why this is happening, as the files were zipped (gzip) using the default settings. The trim logs seem to indicate the trimming proceeds to completion, 

2. This may possibly be related to the first issue, but after completion of trimming and input of assembly parameters after the gnuplot prompts, this `rainbow` error appears that shuts the whole process down:

Now sit back, relax, and wait for your analysis to finish /home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 841: 1269 Segmentation fault (core dumped) rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10 /home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 882: / 100 + 1: syntax error: operand expected (error token is "/ 100 + 1")

After the process shuts down, here is a list of what's left in the working directory, with filesizes on the left:
4096 unpaired
0  trim.log
1027337 rcluster.gz
1040 assemble.trim.log
109509 uniq.F.fasta.gz
1104453  totaluniqseq.gz
1112 cdhit.log
1255 FRS_003.trim.log
1256 FRS_001.trim.log
1256 FRS_002.trim.log
1256 FRS_004.trim.log
1264501 uniq.fasta.gz
136 sort.contig.cluster.ids
164 uniqseq.data
215 xxx
24 uniqseq.peri.data
30 rbdiv.out.gz
324 lengths.txt
32 namelist
4373229 contig.cluster.totaluniqseq
43948224 FRS_003.uniq.seqs
4415013 uniq.k.4.c.2.seqs
44420402 FRS_002.uniq.seqs
46093742 FRS_001.uniq.seqs
475 dDocent.runs
50114803  FRS_004.uniq.seqs
50720656  FRS_003.R2.fq.gz
50840802  uniq.seqs.gz
51329776  FRS_003.R.fq.gz
51936612  FRS_002.R2.fq.gz
5234827 uniq.full.fasta
52537196 FRS_002.R.fq.gz
58284180 FRS_001.R2.fq.gz
58939121 FRS_001.R.fq.gz
60670176 FRS_003.R1.fq.gz
61136939 FRS_003.F.fq.gz
62141619 FRS_002.R1.fq.gz
62677396 FRS_002.F.fq.gz
63877600 FRS_004.R2.fq.gz
64686395 FRS_004.R.fq.gz
69506759 FRS_001.R1.fq.gz
70096018 FRS_001.F.fq.gz
76271641 FRS_004.R1.fq.gz
76923721 FRS_004.F.fq.gz
890 xxx.clstr
9863900 uniqCperindv
9939 dDocent_main.LOG

To be honest, this is old data that someone else (years ago) has preprocessed somewhat to remove UMI elements, so it's unclear to me if the issue is with the software, or the inputs I am giving the dDocent. For the sake of being thorough, here is what the format of the input fasta files looks like:

@cluster_1562 CATCTCCT ATGAAGGGAACTACATTTCCCATATTTCATGAAAAGAGTGGGTGAGCATGATGTTTTCACACCAACTTTCAGGTGTCGTTC + ?BB@?A:3@??A=>@EEB???=>EEBA@BB@DC===AAB;@AB=B:7B@<9@DEEBA636?:A=DEBA:A@9@A=0EB? @cluster_1579 GATATGGT TTGCGAAGCATCTAGTATTGTCACACTCCGTTACTCAACACTATGTATGATGCGCTTTTCTGTGATATCTCGTGGTACTCTTTTT + A->:03<,4.5@1692=2/119:=400/4B0/592<4=.26244.55/1)/B74;3;?@4:1/1-94D.<A6/;3244BDD@C



Any insight as to where the issue stems from would be appreciated. Thank you!
jpuritz commented 6 years ago

Hi Pavel,

This is likely two problems with the input. Something is causing gzip to think there is an unexpected end of file. This could be a hidden character or some other error created during the initial file creation. Not sure.

Rainbow is giving you an error because it appears that these reads were trimmed and are not uniform length which is a requirement for both ddocent and rainbow for assembly.

Hope that helps,

Jon

-- Jon Puritz, PhD

Assistant Professor Department of Biological Sciences University of Rhode Island 120 Flagg Road, Kingston, RI 02881

Webpage: MarineEvoEco.com

Email: jpuritz@gmail.com

Cell: 401-338-8739 Work: 401-874-9020

"The most valuable of all talents is that of never using two words when one will do.” -Thomas Jefferson

On July 14, 2018 at 10:41:49 PM, Pavel V. Dimens (notifications@github.com) wrote:

I needed to run some older Illumina reads through dDocent, and I am getting these two specific errors that are preventing the pipeline from completing:

  1. The first is when initiating trimming, this error appears

Trimming reads and simultaneously assembling reference sequences Removing the _1 character and replacing with /1 in the name of every sequence

gzip: FRS_001.F.fq.gz: unexpected end of file

gzip: FRS_002.F.fq.gz: unexpected end of file

gzip: FRS_003.F.fq.gz: unexpected end of file

gzip: FRS_004.F.fq.gz: unexpected end of file

It's unclear why this is happening, as the files were zipped (gzip) using the default settings. The trim logs seem to indicate the trimming proceeds to completion,

  1. This may possibly be related to the first issue, but after completion of trimming and input of assembly parameters after the gnuplot prompts, this rainbow error appears that shuts the whole process down:

Now sit back, relax, and wait for your analysis to finish /home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 841: 1269 Segmentation fault (core dumped) rainbow div -i rcluster -o rbdiv.out -f 0.5 -K 10 /home/saillantslab/miniconda3/envs/ddocent/bin/dDocent: line 882: / 100 + 1: syntax error: operand expected (error token is "/ 100 + 1")

After the process shuts down, here is a list of what's left in the working directory, with filesizes on the left: 4096 unpaired 0 trim.log 1027337 rcluster.gz 1040 assemble.trim.log 109509 uniq.F.fasta.gz 1104453 totaluniqseq.gz 1112 cdhit.log 1255 FRS_003.trim.log 1256 FRS_001.trim.log 1256 FRS_002.trim.log 1256 FRS_004.trim.log 1264501 uniq.fasta.gz 136 sort.contig.cluster.ids 164 uniqseq.data 215 xxx 24 uniqseq.peri.data 30 rbdiv.out.gz 324 lengths.txt 32 namelist 4373229 contig.cluster.totaluniqseq 43948224 FRS_003.uniq.seqs 4415013 uniq.k.4.c.2.seqs 44420402 FRS_002.uniq.seqs 46093742 FRS_001.uniq.seqs 475 dDocent.runs 50114803 FRS_004.uniq.seqs 50720656 FRS_003.R2.fq.gz 50840802 uniq.seqs.gz 51329776 FRS_003.R.fq.gz 51936612 FRS_002.R2.fq.gz 5234827 uniq.full.fasta 52537196 FRS_002.R.fq.gz 58284180 FRS_001.R2.fq.gz 58939121 FRS_001.R.fq.gz 60670176 FRS_003.R1.fq.gz 61136939 FRS_003.F.fq.gz 62141619 FRS_002.R1.fq.gz 62677396 FRS_002.F.fq.gz 63877600 FRS_004.R2.fq.gz 64686395 FRS_004.R.fq.gz 69506759 FRS_001.R1.fq.gz 70096018 FRS_001.F.fq.gz 76271641 FRS_004.R1.fq.gz 76923721 FRS_004.F.fq.gz 890 xxx.clstr 9863900 uniqCperindv 9939 dDocent_main.LOG

To be honest, this is old data that someone else (years ago) has preprocessed somewhat to remove UMI elements, so it's unclear to me if the issue is with the software, or the inputs I am giving the dDocent. For the sake of being thorough, here is what the format of the input fasta files looks like:

@cluster_1562 CATCTCCT ATGAAGGGAACTACATTTCCCATATTTCATGAAAAGAGTGGGTGAGCATGATGTTTTCACACCAACTTTCAGGTGTCGTTC + ?BB@?A:3@??A=>@EEB???=>EEBA@BB@DC===AAB;@AB=B:7B@<9@DEEBA636?:A=DEBA:A@9@A=0EB? @cluster_1579 GATATGGT TTGCGAAGCATCTAGTATTGTCACACTCCGTTACTCAACACTATGTATGATGCGCTTTTCTGTGATATCTCGTGGTACTCTTTTT + A->:03<,4.5@1692=2/119:=400/4B0/592<4=.26244.55/1)/B74;3;?@4:1/1-94D.<A6/;3244BDD@C

Any insight as to where the issue stems from would be appreciated. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jpuritz/dDocent/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnRR2J0skWJPCTO5HbJ1P7Bhl05tK7Oks5uGqvsgaJpZM4VQFvc .