nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
45 stars 13 forks source link

polish_clusters fails with racon error #2

Closed sdwien closed 5 years ago

sdwien commented 5 years ago

Starting from a bam file of ONT directRNA reads aligned to the mouse genome with minimap (-x splice) I ran the first two commands of pinfish successfully:

spliced_bam2gff -s fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam > raw_transcripts.gff

cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

However, the 3rd step fails with: polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 10 fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam polish_clusters: 13:32:36 Polishing cluster 807d3c5d-18d5-4be1-a3aa-3d6e97d36d86 of size 165 polish_clusters: 13:32:37 Polishing cluster b799bd86-4133-4c78-8727-4dd097073d53 of size 62 polish_clusters: 13:32:37 Failed running command: racon -t 10 -q -1 /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/reads.fq /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/alignments.sam /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/reference.fq > /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/consensus.fq - exit status 1

I also noticed that running the same command again, different clusters are processed first, is that expected? polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 10 fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam polish_clusters: 13:44:01 Polishing cluster 2da54d30-382b-46fe-83c0-ae47c4f34ee9 of size 104 polish_clusters: 13:44:02 Polishing cluster 31b0ce79-2a54-4057-a534-599dbde2d39a of size 52 polish_clusters: 13:44:02 Failed running command: racon -t 10 -q -1 /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/reads.fq /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/alignments.sam /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/reference.fq > /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/consensus.fq - exit status 1

In /tmp/ some files are starting to be generated: `ll /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/

40143 Dec 18 13:44 alignments.sam

0 Dec 18 13:44 consensus.fq

34064 Dec 18 13:44 reads.fq

810 Dec 18 13:44 reference.fq`

To me, it looks like it starts running (first two clusters of size 165 and 62, resp., are processed), but then encounters something it does not like. Any ideas for troubleshooting? The error log is unfortunately not really enlightening.

Many thanks, best, Sophia

bsipos commented 5 years ago

Hi,

Could you please run the failed command (with the files present in the tmp directory) and paste here the racon log? BTW, it is expected that clusters are processed in different order for different runs.

sdwien commented 5 years ago

Hi Botond, thanks for looking into this. I am not getting any more output than what I reported already. in temp/pinfish_xxx/ , I have alignments.sa, reads.fq and reference.fq with some contents and consensus.fq (empty). When I am running the suggested racon command myself, I am getting a fastq parser error: racon -t 40 -q -1 /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/reads.fq /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/alignments.sam /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/reference.fq > /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/consensus.fq [bioparser::FastqParser] error: invalid file format!

The reads.fastq file that is generated (by pinfish?) looks like this, please note that there are some empty entries at the beginning (and also thoughout the file): ` @c21ee410-a840-41b7-9bc1-3be1244d3da1

+

@dcbbd9c7-57cf-4865-9d7e-a7d90b276826

+

@b459b24b-419f-4d6a-baff-bbb8ce60b354 CGGTTTTTTTTTGCTTATGGAAATTATTGCATTCTCAACAGAATTATTTTTTTCTTTTCACATCTTGCTCTGCAGCTTCCTCGCCTTTTTGCCCGGATTGAGTCGGGCATTGGCACGGGGCCTCATTCGAGAACTCCTGGCAAAATTCAAGTTCTCTTCTTCTTCTGAATATAACTCTGGCCTTTTCCTTTGAAGACACATTCCGGAATGGGCATCACAGGTCCTGTCAGCTGGGAGCAAATCCTCAGCAGAAAACTGTCTTCCTTCTGTGCAAGGCTTCCTGGGAGATTTTGGATGAGCTTGGCAGACTCCTTCAGGCGCTATACACGTTGGCCTGCAATGACTCCGAAGGACTTGTTCGCCTCGGGTCCCACAGATGCCGATGGTGCGAAGCCTTTCTTGTGGATGCCAGCCACCCTGAAGCTCCTCCAGGATTTGGTGAAGCCCGCCAGCCCGGCGCTTGGGTGTGTGGATCTCACCGTGGGGCACCTCAATTAGATGGGCCTGATGGGGCCGGACGCGCGGGGGCGGTGGGCGATGCGACACGCTTCGCCAGCCGGTGCCTTGCGCGCCTGCGGATCTCGCGCGCCGGCTGGTTGGTGAAACCGTAACCAAGTGTACCACTCGCTGCTGCCAATCCTTGTGGAAGTGGGGCTTCAGTATCGGCTATTCCGGCTGGACGCCATGGCTGCCTATTGGAACGGCCAG + $$(+0556820*'&/+/+00-,''%$$')-)(%'$'&&+.-+)+369:72),1.(&&((+/4,+)('+.+/))&%'&+)030%,%&''##$''2/11+)+),,,)1-(&(%&'+('(&'((%()+)(,4-'$%&.-0+.1,)0+/)$$$*(&))((',&+).,'+(.0'%(()+%'(.),-+'+,651/0/04/-+/,()-/,85+'&).+,((&%&)+'+-121/.+-+,)&&&$23.1/312+01/++&++/,,2*-),-++&&),((%'&%&%',-''&%%%$&&(),+*+(%$%&%((+()'','%&%%)+)),,((,0--4.(('%&((''()-71))(&&$'%)+-(((&+-+)),-),-+(,2'(')))'()')('(&((&''%&(&',5-(+/+-63)-&'&')(++-))').)---,/1++(-07--'(&%%&'%))())1)-(''-3901/11))%'%,864+,('1/,))%&&))'()*-01/.&0))%+0+---)(,+&'()-.++&%%%&))((('')'''#$(+%%'&'&-'+(''%)(01(&((++((+)+(','&%%$)&&$&&+,80+++(''%331-0.%+0434330)3.,++('')0-'&(&$%&)'+..(&%#% @5ea16738-e4c6-4226-924d-bc8422747150 ` The beginning of the alignments.sam file also has these empty entries: @SQ SN:f79f3b0d-8727-4e35-9b71-0457e2aad164 LN:0 @PG ID:minimap2 PN:minimap2 VN:2.14-r883 CL:minimap2 -ax map-ont -t 40 -k14 /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/reference.fq /project/201809_3T3/20180913_1603_3T3L1_Day7/fastq/workspace/temp/pinfish_f79f3b0d-8727-4e35-9b71-0457e2aad164_666575378/reads.fq 4fbacedf-ca53-4228-9c95-a68ea8b1cb50 4 0 0 0 0 1cf559a1-b14d-456e-a9c5-992bccdea9aa 4 0 0 0 0 14faef87-b490-4d42-b313-6b60f5b03d44 4 0 0 0 0 55675a6b-9844-4762-b660-0aafda2427a6 4 0 0 0 0 85c8a95f-de61-4ed4-9fa2-035168e24d19 4 0 0 0 0 fe116dfe-0432-4bd0-8f85-bbcf069cc748 4 0 0 0 0 65739421-4bf3-402b-9571-3f1a49a19eba 4 0 0 0 0 4c2b5139-3759-4c85-8eeb-cf9550e2dc4e 4 0 0 0 0 7a99713e-9863-44e2-b563-0b024a5d0128 4 0 0 0 0 4f19a0c3-c4a1-4095-8cf3-eaff58f6624a 4 0 0 0 0 c21ee410-a840-41b7-9bc1-3be1244d3da1 4 0 0 0 0 dcbbd9c7-57cf-4865-9d7e-a7d90b276826 4 0 0 0 0 b459b24b-419f-4d6a-baff-bbb8ce60b354 4 0 0 0 0 CGGTTTTTTTTTGCTTATGGAAATTATTGCATTCTCAACAGAATTATTTTTTTCTTTTCACATCTTGCTCTGCAGCTTCCTCGCCTTTTTGCCCGGATTGAGTCGGGCATTGGCACGGGGCCTCATTCGAGAACTCCTGGCAAAATTCAAGTTCTCTTCTTCTTCTGAATATAACTCTGGCCTTTTCCTTTGAAGACACATTCCGGAATGGGCATCACAGGTCCTGTCAGCTGGGAGCAAATCCTCAGCAGAAAACTGTCTTCCTTCTGTGCAAGGCTTCCTGGGAGATTTTGGATGAGCTTGGCAGACTCCTTCAGGCGCTATACACGTTGGCCTGCAATGACTCCGAAGGACTTGTTCGCCTCGGGTCCCACAGATGCCGATGGTGCGAAGCCTTTCTTGTGGATGCCAGCCACCCTGAAGCTCCTCCAGGATTTGGTGAAGCCCGCCAGCCCGGCGCTTGGGTGTGTGGATCTCACCGTGGGGCACCTCAATTAGATGGGCCTGATGGGGCCGGACGCGCGGGGGCGGTGGGCGATGCGACACGCTTCGCCAGCCGGTGCCTTGCGCGCCTGCGGATCTCGCGCGCCGGCTGGTTGGTGAAACCGTAACCAAGTGTACCACTCGCTGCTGCCAATCCTTGTGGAAGTGGGGCTTCAGTATCGGCTATTCCGGCTGGACGCCATGGCTGCCTATTGGAACGGCCAG $$(+0556820*'&/+/+00-,''%$$')-)(%'$'&&+.-+)+369:72),1.(&&((+/4,+)('+.+/))&%'&+)030%,%&''##$''2/11+)+),,,)1-(&(%&'+('(&'((%()+)(,4-'$%&.-0+.1,)0+/)$$$*(&))((',&+).,'+(.0'%(()+%'(.),-+'+,651/0/04/-+/,()-/,85+'&).+,((&%&)+'+-121/.+-+,)&&&$23.1/312+01/++&++/,,2*-),-++&&),((%'&%&%',-''&%%%$&&(),++(%$%&%((+()'','%&%%)+)),,((,0--4.(('%&((''()-71))(&&$'%)+-(((&+-+)),-),-+(,2'(')))'()')('(&((&''%&(&',5-(+/+-63)-&'&')(++-))').)---,/1++(-07--'(&%%&'%))())*1)-(''-3901/11))%'%,864+,('1/,))%&&))'()*-01/.&0))%+0+---)(,+&'()-.++&%%%&))((('')'''#$(+%%'&'&-'+(''%)(01(&((++((+)+(','&%%$)&&$&&+,80+++(''%331-0.%+0434330)3.,++('')0-'&(&$%&)'+.*.(&%#%

My input bam file (from minimap2) looks like this, so there are also entries without sequences: 68825a95-32f0-4e71-8469-dbc44880f323 272 chr1 4687878 0 2S9M1D9M5I5M1D4M1D5M1I13M2D1M1D7M1D25M1D17M3D5M1D12M1I3M1D13M1I13M2I17M2D31M1D4M2I33M3I1M1D42M1I4M5D11M1D12M5D3M1D11M2I3M1I1M2I3M1I25M2D8M2I2M4D9M3I3M2D33M1D8M1D6M10D13M5D22M1D9M3D20M2D6M1I11M2D7M2D6M4D13M9I32M7D7M1I13M3D19M2D8M1D13M1D2M1D20M1D15M5S * 0 0 * * NM:i:178 ms:i:249 AS:i:249 nn:i:0 ts:A:+ tp:A:S cm:i:7 s1:i:102 dv:f:0.1753 b0f56eb9-3a87-4ec6-9653-e7dd8922ea5a 272 chr1 4687879 0 3S18M1I17M2D21M1D6M1D9M1I25M2I10M1D15M1D14M3D26M1I5M2D1M1D16M1D11M1I14M1D4M1I47M1I32M2I13M1I6M2I2M1D6M1I12M2I3M1I16M1I7M2D7M2I9M1I21M1I28M8D2M5D24M1I3M2D3M1D3M1D44M1D16M1I8M1D26M4I32M4D9M2D4M1I1M1I14M1D6M1D6M2D2M1D11M1I13M1D8M2D23M1D9M2I39M4D12M1D6M1D31M2D16M2D5M10D3M1D3M2D1M4D32M2I7M2I6M1D48M1D14M3D5M3D2M1D9M1D15M1I2M1I7M1D3M2D3M1I24M1D37M1D10M1D13M2D2M1D9M1D32M2D9M1I7M2D41M3D5M1D6M1D2M1D6M4D3M1D8M1D16M3D4M3I8M2I20M1D23M2I2M1I5M1I2M2D17M1D11M1D19M1D8M1I9M1I5M7D10M2D18M1D38M7S * 0 0 * * NM:i:291 ms:i:671 AS:i:671 nn:i:0 ts:A:+ tp:A:S cm:i:23 s1:i:254 dv:f:0.1558 819819e6-f330-4d8b-98b8-05b96ec8ff65 272 chr1 4687880 0 2S11M1D10M2D3M1D6M2D52M1I11M1D9M1D15M1D5M2D2M1I16M1D18M1I3M1D31M1D5M2I4M4D2M1D9M2D42M1D4M2D14M1D7M1D13M3D4M1D21M1I23M1D7M3S * 0 0 * * NM:i:62 ms:i:183 AS:i:183 nn:i:0 ts:A:+ tp:A:S cm:i:3 s1:i:43 dv:f:0.1939 6b293f43-0db1-427b-888e-f8dc061fb2c6 16 chr1 4687880 23 2S14M1D7M1D4M1D6M3D16M1I2M2D6M2D6M1I4M3D13M4D4M1I18M1D7M1D4M2I10M3D8M1I9M1D33M1D17M1I13M3I7M1D25M2I1M1I16M1D6M3I6M2D10M1I4M1I2M1I3M1D7M1I15M2I26M2D4M1I5M2D10M1D14M1I14M1D11M2D5M4D1M2D4M6D1M3D25M4D51M5D19M1I7M * 0 0 GGTTTGGTCACAGAACTTATTGAGGGCGGGCTCCATCTGCGCAGCTACAGACAATCCCGGTAGGCCCCGCTCCACCAGCTTCTTTATTTTGCGCTTGAACTTGGCCGTTCTCGGGCTTTCAATCATATTCTTCTGGAAAGCTCTTGATCGGTCTCGGAGAATGAAACCTTCTGGCTTCAAGTCCTGAGTGAGCTTGAACAACTCAGAGCTATAAAGCTGCCATCAATGTCAGGAGCCTGGTACTTTGGAAGCCGTCCCAGCCTTCGGGCTTGTAGTCAGCCTGCCAGTCGACCATAATAACGTGCCGCTGCCCTCTGACGTGCCAATGTTCTGCCAGCCTTCGGGCCACCTGGATCGAATCCCGCAGCCTGAAAGTTCTTGGTGCTGAAAGCCGGGCTGCCCCAGTGCAGCCTTGCACATTACCCGAGCAGCTTTCTCCCGCAGCCATTTGCTGCTCCGTCTTCTTCTCCATCCTCTTCTCAGGACCAGCAGCACCAGTGGATCTTTGTGGTCCCATCACCCAGCCT ""-,(+)')))+(*))..)((++/5/*27*)+)((&)&%'%)*(-,,,--++.*),)%'((&/'*/1)')),,,/4-+()&%'0))*131**++,/,03-36.0+1-0423410-)**,()+++(%*05,31/'+&*(-)+---000)+(&&%&%)-'&()&&)/*-,-+(,0++1+)(%()2'('*'&&&$(&$&'&0+-,,*,%''*)%%'*)).-)-,-,,-+&&&+*)+(',+(*.,++.4.,),'',3*++5--('*'('+2.).*('(*.,&%(&$%(+))))%$+)%$&'(.%('%&1'++&%+))(.&(((,(*,.()'')*'&&.')((+%*',1+**()&#&%$&&(%&,+**))*(,,''+-*1/4**)--*0,*('+.(+.,9:,&'-2.)')(240/.(*.,++-),'&**%%%$&''&/3.*,-+&&',-2*)//)++(*+)/,,+'-(&'&%&,),)('&'+-,++-04-.5/423011&/+(&(&(-++1)(()*&02+('(%$+*'&+(& NM:i:120 ms:i:220 AS:i:220 nn:i:0 ts:A:+ tp:A:P cm:i:5 s1:i:55 s2:i:55 dv:f:0.1435 Maybe this is a problem? I will try and remove these empty entries. Looking forward to hearing your thoughts on this, best, Sophia

bsipos commented 5 years ago

I think the issue is with the malformed fastq records indeed. Did it help removing them from the input?

sdwien commented 5 years ago

I used a bam file generated from the minimap2 bam file that contains only alignments flagged with 16 (primary alignments to reverse strand). Now, I get a different error:

polish_clusters -a clusters.tsv -c 50 -d ./temp/ -o consensus_transcripts.fas -t 40 fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort_f16.bam

polish_clusters: 14:18:03 Polishing cluster 75789380-be64-430b-9257-1adfac7e81c2 of size 70 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4ea1b9] goroutine 1 [running]: main.getMedian(0xc443e80240, 0x46, 0x46, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /home/OXFORDNANOLABS/bsipos/gt/pinfish/polish_clusters/polish.go:119 +0x99 main.CreateReference(0xc423ee8755, 0x24, 0xc443e80240, 0x46, 0x46, 0xc45d0fd980, 0x78, 0x0, 0x2d, 0x4c3b27) /home/OXFORDNANOLABS/bsipos/gt/pinfish/polish_clusters/polish.go:72 +0x74 main.PolishCluster(0xc423ee8755, 0x24, 0xc443e80240, 0x46, 0x46, 0xc42010e060, 0x7ffd75666522, 0x7, 0x28, 0x0, ...) /home/OXFORDNANOLABS/bsipos/gt/pinfish/polish_clusters/polish.go:25 +0x2cd main.main() /home/OXFORDNANOLABS/bsipos/gt/pinfish/polish_clusters/main.go:46 +0x320

Does that mean anything to you? Thank you, best regards!

bsipos commented 5 years ago

Does clusters.tsv contain by any chance entries which are missing from the BAM file?

sdwien commented 5 years ago

You were right, I had to regenerate the clusters.tsv and the clustered_transcripts.gff files from the cleaned-up bam file. polish_clusters runs without complaints now. Many thanks for your help, best, Sophia

CeciliaDeng commented 3 years ago

Hi @sdwien @bsipos , I have exactly the same error. How did you get it solved? I filtered the minimap bam file with '-f 16', sorted and indexed the clean bam. _splicedbam2gff and _clustergff finished successfully with raw.gff, clusters.tsv and clusters_transcripts.gff. However, when I ran _polishclusters, it failed with exit status 134 and dumped a large core file. The errors are

cat .log/polish_cluster/Run9.err polish_clusters: 09:45:01 Polishing cluster 891b4d1a-4db8-4709-b19a-990ed7fd5dcc of size 20 polish_clusters: 09:45:01 Polishing cluster 7810adbd-00a9-4ef9-bec7-678c2619d065 of size 292 polish_clusters: 09:45:02 Polishing cluster 21648e6b-76c9-40e4-a7cc-96b230dd52f9 of size 51 polish_clusters: 09:45:02 Polishing cluster ece6d41e-df32-454d-aaed-4e1fa1001a70 of size 26 polish_clusters: 09:45:03 Polishing cluster 1d0a32cb-79cb-47f6-ba6b-d3d50358ac7d of size 31 polish_clusters: 09:45:03 No consensus from cluster 1d0a32cb-79cb-47f6-ba6b-d3d50358ac7d, using representative sequence! polish_clusters: 09:45:03 Polishing cluster 51bc352e-4358-4c16-b090-5119e285faa6 of size 20 polish_clusters: 09:45:03 Polishing cluster 40c2033d-8927-4f4c-bae2-2588fcc90049 of size 104 polish_clusters: 09:45:04 Polishing cluster 46cee59a-195d-4c11-8b9b-5009a2af13d2 of size 144 polish_clusters: 09:45:04 Polishing cluster 2f3f9f7a-f10c-4d4a-afde-e7df01955b86 of size 67 polish_clusters: 09:45:04 Polishing cluster c5867e07-36f8-4098-86e5-20b75572059c of size 20 polish_clusters: 09:45:05 Polishing cluster 014a73a8-f9a5-4fd1-8745-60dbfde97039 of size 45 polish_clusters: 09:45:05 Polishing cluster 728019a8-b2a3-45bb-aa0a-2fccf27d32b1 of size 39 polish_clusters: 09:45:05 Failed running command: racon -t 10 -q -1 --threads 10 /ONT_RNA/Pinfish/tmp/Run9/pinfish_728019a8-b2a3-45bb-aa0a-2fccf27d32b1_923305693/reads.fq /ONT_RNA/Pinfish/tmp/Run9/pinfish_728019a8-b2a3-45bb-aa0a-2fccf27d32b1_923305693/alignments.sam /ONT_RNA/Pinfish/tmp/Run9/pinfish_728019a8-b2a3-45bb-aa0a-2fccf27d32b1_923305693/reference.fq > /ONT_RNA/Pinfish/tmp/Run9/pinfish_728019a8-b2a3-45bb-aa0a-2fccf27d32b1_923305693/consensus.fq - exit status 134

Any suggestions, please?