nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
44 stars 13 forks source link

Polish Cluster stops midway #27

Closed KaushikPanda1 closed 3 years ago

KaushikPanda1 commented 4 years ago

Hi, The pipeline stops midway at the polishing cluster step. It does not error out or killed, it simply hangs randomly after polishing some clusters. If I run just the polish cluster step multiple times, it hangs at different clusters, so there should not be inherently anything wrong with any cluster.

Few observations: 1) The pipeline runs successfully for a smaller genome but is having issue with a bigger genome. So can this be a memory issue? I did not see an option to provide more memory. 2) I provide the program a scratch space of up to 1TB, and so I do not think space is an issue. 3) The program completes successfully when I set the threshold of cluster size to 50 or 100, but with lower threshold values like 15 or 20 the program does not run reliably.

Any guidance will be really helpful. Thanks.

number-25 commented 4 years ago

I am having this identical issue, but with cluster_gff -c 5 -t 6 As above, plenty of scratch space available.

input gff file is 6.4G in size

Best, Dean

charlottewright commented 4 years ago

I am also having the same issue, with cluster_gff -c 5 -t 10. Input genome is 0.334 GB and the gff file is 0.1 GB so really not big. The pipeline runs fine when I supply it the reads obtained from a single nanopore run but when I provide it with a fastq file containing the reads from five runs combined it gets stuck. The strange thing is that I have tried increasing the threads to 30 and memory and the pipeline gets stuck at the same point (cluster_gff) and the file sizes of clustered_transcripts.gff and cluster_memberships.tsv are the same as when I run with 10 threads. Did anyone find a workaround to this issue?

danledinh commented 3 years ago

Similar issue here too. Starts running fine then throws error

polish_clusters: 13:35:41 Polishing cluster 27cc9a42-6daf-4f59-96d3-794bbc78e668 of size
 37
polish_clusters: 13:35:43 Failed running command: racon -t 1 -q -1 \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/reads.fq \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/alignments.sam \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/reference.fq > \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/consensus.fq \
 - exit status 134

When I look for the racon input files, everything checks out. Running the command that failed from above, I get this error message:

[racon::Polisher::initialize] loaded target sequences 0.000107 s
terminate called after throwing an instance of 'std::invalid_argument'
  what():  [bioparser::FastqParser] error: invalid file format!

Because the FastqParser module is throwing the error, I investigated the */reads.fq input file. I found at least one instance of an empty fastq entry. The read name is present but no sequence data:

#showing 2 reads. First = empty, Second = valid. 
@d40c1ded-3478-429a-a32b-103ded1a14be

+

@2f305979-9ef1-45e8-8afc-e512089ca100
GCACAGAGGGACCCTCTATCATGGCTTCAGGGGGTGCCCAGGTCCTTCGGTAGCTGGTGGGCGTGAGCGCAGGAAGCACATTTGGCTTGGCAGGAGCTGTCAGGAAGGGAAGCGGGTTTCCTGCGATTCTGC
TGTTCTGCCCCTGGGTGAAGCCGCACACCTCCTCCCAGGGGCCCAGGC
+
)+##$)*(+4;::2<9/%57%('-1131$3310:=922(/,-43001%%'--,HH915566-0=@52*)//)40(0;>;8:;471((((,1282249:4>9,:;-..9><24--*-..)3($&,%9;<:8/#
#'$'+-07:;:98-:))-/01-),0&-569;+5@AA6<314<<;(3/'

Looks like one or more of these empty fastq entries might be causing the racon error.

claumer commented 3 years ago

Having exactly the same issue as danledinh

(base) [claumer@BUCEPHALUS DLY001_Ctnc_cDNA_R10]$ ../pinfish/polish_clusters/polish_clusters -a DLY001_cDNA_pychopped_rm_clusters.tsv -o DLY001_cDNA_pychopped_rm_pinfish.fasta -t 1 DLY001_cDNA_pychopped_rm.bam polish_clusters: 18:29:33 Polishing cluster 4da42048-684c-4493-abd9-450233a7c2e8 of size 3 polish_clusters: 18:29:33 Polishing cluster 23b44382-98a4-4b7c-84a6-3bcd5a6f3f48 of size 3 polish_clusters: 18:29:33 Polishing cluster 5d29cfbd-4d9f-4134-9634-8c571c71ee8a of size 9 polish_clusters: 18:29:33 Polishing cluster cfdba0b7-e7c5-4ee1-9fe6-1992a3183e00 of size 3 polish_clusters: 18:29:33 Polishing cluster b4d05c4e-784a-4b4b-834b-6a27f8284337 of size 6 polish_clusters: 18:29:34 Failed running command: racon -t 1 -q -1 /tmp/pinfish_b4d05c4e-784a-4b4b-834b-6a27f8284337_756416823/reads.fq /tmp/pinfish_b4d05c4e-784a-4b4b-834b-6a27f8284337_756416823/alignments.sam /tmp/pinfish_b4d05c4e-784a-4b4b-834b-6a27f8284337_756416823/reference.fq > /tmp/pinfish_b4d05c4e-784a-4b4b-834b-6a27f8284337_756416823/consensus.fq - exit status 134

And when I less the reads.fq, I see (quoting a section that has the empty record):


@93:1039|f647c837-39a0-4b36-aab3-b48010630871 GACGGCCATAACAATATGTTGATTGCAATCCTCTGTGGTATTCTCCTGGTGTCTTTGGTGGACGGCCAATTGACCAAGAAAAAGAGAAACAATTCATTCTGGATGAAGCCAATCGCAATCGCAATGATGTGGCCAAACAACAAAAGATTCCCAACATGGTTCTTGATGGAATCGACGAATCGTGCCGCTGACAAACTCAAGAGTTCACCAGTCTTTGGGAATCTAAGCTTTTCAGCCATGACGGCAAATTTGGAAGATTTGCAAACGTGTTCGACTTCCATATGGTGTCATTGTTGCAAATCTCATTGAGCGAGTGCGCTGAAAAGCTTGGTGTCGGGCTCCCAATTCCTATGCTTGCAAACAAGTTACCTTTCCTGGTGCCTCTGTGATTGGATGTTGGAAAAACGGAATGCACCCAAAGTCGATTCGAAAGGTTCCTTCTCCTGCAACATTGCTAAGAAGGGGAGGATACACCCAAGGTGCTTCGTTGAATGGAGAACCCCTGCACTCAATGCCCCTCGGGTTACTCCCAATGCATGGATAATTTGTGTGCCAAACCCAGTCAGTGTTCTCGAGGAAATGTGCGATGCACTCCGTCACACCCACACAAACTGCTCTGACAAGTGAAGCAGCGCCCCAACAAGGATAACATCATTCCACGCATTATTCCATTCACATTCAACACGGATGACAGTTGTCAAGACATGTATCCAGATTACTGTGTGGATATGAAACCGGGCAATTGCTCTTTCTGTCCATACATTCGCAGTTTGTACTGTCCAAGAACATGCAACAATTGCTATTGATGAATTACGATAAATGTAATGAAGAATTTACTGCATCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAATAAAAAAAAAAAA + %#%(4/&##%)%&+(0-68&%%-'(7@;)3=?;AC@FADF:?@AAE;ADBAAO=E=B?6>26?:BCD@?878-+-$'1?A='**DB?9;HH@MKH=?@??HOKJ?FHF6==@:=7@AHFKLGGFAAED@=DDIIFCEEPEEBGH@<DA=@?;5?.%)%(?=:4CD>-43&#(76()'&74%6:766;;>&A80642<:==9=:9B@BG9'$%%,+$1.%/2@;6@A@?A>A?;7;>?IGB9>GKJ@==F;7:::=,'('6'54:2?=?<CD9:;<=?8%8-&..,,11D=,05162$$$&%&&8+3057('AHC:9A;3A?C06=<6<7;9:B>435@:BDA@@:ACLBJH<==:=D017;=A<>B?@=9798=/99/-('1052/)-3830./?B:7C+$5313A&$,&77:($&282;0$&08-;>8;=DHD==?;/.4''&%.379885=;>7@'):0&114;69B=<AGCFFEBA9A9:=4;9;8//-)5(&?.@;?-==7>?:>=;=BACCAGEBC@EFHFOJB>3;=;%./IH@B==C82;??9FAA:<@B@MOHDEB==?@=:?8;9);:D::<96FB@DCDDI>D@AA?>A?HG;B:D/:<<?B=$17>9=:><>;85,-=67999'.AC<@@DIHECA;:>=300/2128FB8).>C?=FF@DJD?D&4:::;+98A;.-1(('(/%*46456BFG8..@FB8>8B>>BFFA;;-'D+:@:6:1420;:GI>?E@?CF?@;685=<;%%%&?,1<<=@B?CB?B>FB?I@CA=DH?A8-@?BNKIE:D=>DHONFFFEGKD8;?DA?.+,::=<=;;:9343:9<:<9;::56577;:>==999<<=;:76244545333321-//42313243523366:9;6887642212C?F@@CB@@:64310/// @751:4292|09afd5f0-283b-44a8-87e6-8968f819131e

+

@119:565|51eb6cff-fd3d-4a4d-bdc7-04af6565e96b CTCTGGTGATTTATGTGGAAAAACGGAATGCACCCAAAGTCGATTCAAAGGTTCCTTCTCCTGCAACATTGCTAGGGAGGATACGCCCAAGGTGCTTCGTCGAATGGAGAACCCTGCACTCAATGCCCTCGGGTTACTCCCAACGCATGGATAATTTGTGTGCCAAACCCAGTCAGTGTCTCCAGGAACGTGCGATGCACTGCCGTCACACCCACACAAACTGCTCTGACAAGTGAAGCAGCGCCCAACAAGGATGACATCATTCCACGCATTATTCCATTCACATTCAACACGGATGACAGTTGTCAAGACATGTATCCAATTACTCGTGGATATAAACCGGGCAATTGCTCCTTCTGTCCATACATTCGCAGTTTGTACTGTCCAAGAACATACCAACAATTGCTATTGATGAATTACGATAATGTAATGAAGAATTTACTGCG + 6:33-47:=&&67>>A?3?FE@D8>BEFD?>=::<G?A=40<..):;;;>5.-//-1,'',&$%)$9>6@C?@?<::=@3=982(;4<2./7&%A@472.:?<@=A:?:=<;557=?<A8;=>CDC=<;?8:F(;9;6:620;=@98:29A??AI9EB89>993'.+%%%)9;07/.@:;/39AD?>@?;/&$*0/538370)44A+7-.76445++065//-:57=9::=25C.>E:-36841;;2:>>;3<<?AD==BCIA@BECCB?C@?GAAC?6/)))>DB@DEC?>552A=>975BBB,+996)'#%&3,,++$5>?=5>>=432+**67ACIJC;<A9AABC?<BGC@FHBOIIDEC>>:9@?39<5-%#'4AF@:DA@?CDAEA@99:ED@BDB;B;CB<==HFHCGC<=C>@=9=9-$ @114:846|6d4fbf6b-3651-43fa-aa07-5c3d5770bb4b GCAATCGCAATGATGTGGCCAAACAACAAAAGATTCCCAACATGGTTCTGATGGAATGGGACGATCGCGCCGCCTGACAAAGTTGAACAGTTCCATCGCCAAAGTTTTTTGCCGGATCTAGTTTCAGCCATGACGGCAAATTGGAAGATTTGCAAACGTGTTTCGACTTTCGATATGGTGTCATTATTGCAAATCATTGAGCGAGATGCGCTGAAAGCTTGGTGTCGGGCTCCCAATTCCTAATGCCTTGCAACAAGTTACCTTTCCTGTGCCTCTGCGCATATTGGATGTGGTAAAAAACGGAATGCACCCGAAAGTCGATTCAAAGGTTCCTTCTCCTGCAACATTGCTGAGGGAGGATACGCCCAAAGGTGCTTCGTTGAAATGGAGAACCTCCACTCAATGCCCCTCGGGTTACTCCCAATGCATGGATAATTTGTGTGCCAACCCCAGTCAGTGTTCTCGAGGAAGTGCGATGCACTGCCGTCACACCCACACAAACTGCTCTGACAAGTGAAGCAGCGCCCAACAAGGATGACATCATTCCACGCATTATTCCATTCACATTCAACACGGATGACAGTTGTCAAGACATGTATCAGATCACCGTGGATATGAAACTGGCAATTGCTCCTTCTGTCCATACATTCGCAGTTTCTGTGTCCAAGAACATGCAACAATTGCTATTGATGAATTACTCTATAAAATGTAATGAAGAATTTACTGCATCAG


Similar issue here too. Starts running fine then throws error

polish_clusters: 13:35:41 Polishing cluster 27cc9a42-6daf-4f59-96d3-794bbc78e668 of size
 37
polish_clusters: 13:35:43 Failed running command: racon -t 1 -q -1 \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/reads.fq \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/alignments.sam \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/reference.fq > \
/tmp/pinfish_27cc9a42-6daf-4f59-96d3-794bbc78e668_091686421/consensus.fq \
 - exit status 134

When I look for the racon input files, everything checks out. Running the command that failed from above, I get this error message:

[racon::Polisher::initialize] loaded target sequences 0.000107 s
terminate called after throwing an instance of 'std::invalid_argument'
  what():  [bioparser::FastqParser] error: invalid file format!

Because the FastqParser module is throwing the error, I investigated the */reads.fq input file. I found at least one instance of an empty fastq entry. The read name is present but no sequence data:

#showing 2 reads. First = empty, Second = valid. 
@d40c1ded-3478-429a-a32b-103ded1a14be

+

@2f305979-9ef1-45e8-8afc-e512089ca100
GCACAGAGGGACCCTCTATCATGGCTTCAGGGGGTGCCCAGGTCCTTCGGTAGCTGGTGGGCGTGAGCGCAGGAAGCACATTTGGCTTGGCAGGAGCTGTCAGGAAGGGAAGCGGGTTTCCTGCGATTCTGC
TGTTCTGCCCCTGGGTGAAGCCGCACACCTCCTCCCAGGGGCCCAGGC
+
)+##$)*(+4;::2<9/%57%('-1131$3310:=922(/,-43001%%'--,HH915566-0=@52*)//)40(0;>;8:;471((((,1282249:4>9,:;-..9><24--*-..)3($&,%9;<:8/#
#'$'+-07:;:98-:))-/01-),0&-569;+5@AA6<314<<;(3/'

Looks like one or more of these empty fastq entries might be causing the racon error.

bsipos commented 3 years ago

This pipeline is no longer recommended for reference-based isoform analysis. Please use the newer pipeline-nanopore-ref-isoforms pipeline instead.