tacorna / taco

Multi-sample transcriptome assembly from RNA-Seq
http://tacorna.github.io
Other
22 stars 7 forks source link

Error when processing StringTie GTFs? #21

Open mncfletcher opened 6 years ago

mncfletcher commented 6 years ago

Hello,

I've been running TACO on a set of input StringTie GTFs, as follows:

$ ./taco_run --output-dir 20180426_TACO_test_meta-assembly ./metaassembly_list_filtered6samples.txt 2018-04-26 15:58:55,233 pid=49351 INFO - taco version 0.6.2 2018-04-26 15:58:55,233 pid=49351 INFO - ------------------------------------------------------------------------------ 2018-04-26 15:58:55,233 pid=49351 INFO - verbose logging: False 2018-04-26 15:58:55,233 pid=49351 INFO - num processes: 1 2018-04-26 15:58:55,233 pid=49351 INFO - output directory: 20180426_TACO_test_meta-assembly 2018-04-26 15:58:55,233 pid=49351 INFO - filter min length: 200 2018-04-26 15:58:55,233 pid=49351 INFO - filter min expression: 0.5 2018-04-26 15:58:55,233 pid=49351 INFO - filter splice juncs: 0 2018-04-26 15:58:55,234 pid=49351 INFO - additional splice motifs: 2018-04-26 15:58:55,234 pid=49351 INFO - reference genome FASTA file: None 2018-04-26 15:58:55,234 pid=49351 INFO - reference GTF file: None 2018-04-26 15:58:55,234 pid=49351 INFO - guided assembly mode: False 2018-04-26 15:58:55,234 pid=49351 INFO - guided strand mode: False 2018-04-26 15:58:55,234 pid=49351 INFO - guided ends mode: False 2018-04-26 15:58:55,234 pid=49351 INFO - GTF expression attribute: FPKM 2018-04-26 15:58:55,234 pid=49351 INFO - isoform fraction: 0.05 2018-04-26 15:58:55,234 pid=49351 INFO - max_isoforms: 0 2018-04-26 15:58:55,234 pid=49351 INFO - assemble_unstranded: 0 2018-04-26 15:58:55,234 pid=49351 INFO - change point: True 2018-04-26 15:58:55,234 pid=49351 INFO - change point pvalue: 0.01 2018-04-26 15:58:55,235 pid=49351 INFO - change point fold change: 0.85 2018-04-26 15:58:55,235 pid=49351 INFO - change point trim: True 2018-04-26 15:58:55,235 pid=49351 INFO - path frac: 0.0 2018-04-26 15:58:55,235 pid=49351 INFO - max paths: 0 2018-04-26 15:58:55,235 pid=49351 INFO - Samples: 62 2018-04-26 15:58:55,235 pid=49351 INFO - Aggregating GTF files 2018-04-26 15:58:55,235 pid=49351 INFO - Aggregating in parallel using 1 processes Process Process-1: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Here's the head of that first input GTF:

chr1 StringTie transcript 661810 668798 1000 - . gene_id "STRG.46"; transcript_id "STRG.46.1"; cov 2.886979; FPKM 0.107206; TPM 0.267005; chr1 StringTie exon 661810 665184 1000 - . gene_id "STRG.46"; transcript_id "STRG.46.1"; cov 2.596889; exon_number 1; chr1 StringTie exon 665278 665335 1000 - . gene_id "STRG.46"; transcript_id "STRG.46.1"; cov 9.672414; exon_number 2; chr1 StringTie transcript 665669 666607 1000 - . gene_id "STRG.45"; transcript_id "STRG.45.1"; cov 4.79393; FPKM 0.17802; TPM 0.443372; chr1 StringTie exon 665669 666607 1000 - . gene_id "STRG.45"; transcript_id "STRG.45.1"; cov 4.79393; exon_number 1; chr1 StringTie exon 667397 667587 1000 - . gene_id "STRG.46"; transcript_id "STRG.46.1"; cov 5.657221; exon_number 3; chr1 StringTie exon 668687 668798 1000 - . gene_id "STRG.46"; transcript_id "STRG.46.1"; cov 3.390395; exon_number 4; chr1 StringTie transcript 671471 674675 1000 - . gene_id "STRG.47"; transcript_id "STRG.47.1"; cov 6.502955; FPKM 0.241483; TPM 0.601433; chr1 StringTie exon 671471 671999 1000 - . gene_id "STRG.47"; transcript_id "STRG.47.1"; cov 3.571001; exon_number 1; chr1 StringTie exon 672093 672150 1000 - . gene_id "STRG.47"; transcript_id "STRG.47.1"; cov 16.501148; exon_number 2;

But I can then check the tmp files for this first sample, and I can't see that there's anything wrong here...

$ tail *bed ==> transfrags.bed <== chr1 234859089 234859844 1.886 2.22552961936 + 0 0 0 1 755 0 chr1 235461927 235471713 1.887 6.14092572339 + 0 0 0 1 9786 0 chr1 235510525 235511962 1.888 2.99419129846 - 0 0 0 1 1437 0 chr1 236069333 236071534 1.889 2.77118211257 - 0 0 0 1 2201 0 chr1 236072325 236101360 1.890 13.8912841355 + 0 0 0 2 434,2201 0,26834 chr1 236072325 236136530 1.891 13.8423751983 + 0 0 0 4 434,174,116,10910 0,45847,48885,53295 chr1 236072325 236112891 1.892 11.3264622779 + 0 0 0 2 434,5539 0,35027 chr1 236072325 236136530 1.893 3.28182860696 + 0 0 0 4 434,160,116,10910 0,45847,48885,53295 chr1 236073226 236136530 1.894 3.4367933961 - 0 0 0 3 2542,122,1470 0,47955,61834 chr1 236089103 236093288 1.895 4.40279896114 - 0 0 0 1 4185 0

==> transfrags.filtered.bed <== chr1 16138815 16138915 1.54 4.70818792192 - 0 0 0 1 100 0 chr1 28833883 28833999 1.120 950.387116491 - 0 0 0 1 116 0 chr1 60461217 60461346 1.252 4.01131124351 + 0 0 0 1 129 0

Is there some weird formatting with my input GTFs that I'm missing? Or is the issue somewhere deeper?

I'd love to give TACO a go - having spent too many weeks playing with cuffmerge and stringtie --merge I am extremely happy to try anything that may perform better than them...!

Thanks very much for your help!

yniknafs commented 6 years ago

happy to help. What exactly is the error you are receiving? Does the run finish?

mncfletcher commented 6 years ago

So it hangs on those IndexErrors, but doesn’t crash. If I check the resource usage of the cluster job (as I’m running it in an interactive PBS job currently), CPU usage is 0.

If I run in multi-threaded mode with 8 threads, I get the same errors from each child process in turn, and again everything hangs; no crash back to prompt.

When I kill the proc then I get the following, final traceback:

=== Process Process-8: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-4: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-5: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-7: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-2: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-1: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-6: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range Process Process-3: Traceback (most recent call last): File "multiprocessing/process.py", line 258, in _bootstrap File "multiprocessing/process.py", line 114, in run File "taco/lib/aggregate.py", line 168, in aggregate_worker File "taco/lib/aggregate.py", line 127, in aggregate_sample File "taco/lib/transfrag.py", line 78, in to_bed IndexError: list index out of range ^CTraceback (most recent call last): File "taco/taco_run.py", line 57, in

Failed to execute script taco_run

On Apr 26, 2018, at 4:27 PM, yniknafs notifications@github.com<mailto:notifications@github.com> wrote:

happy to help. What exactly is the error you are receiving? Does the run finish?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/tacorna/taco/issues/21#issuecomment-384660739, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ8_Ykkyh3UpVw1M4kmCyBbm0vBWdvUbks5tsdlOgaJpZM4TlPmE.

yniknafs commented 6 years ago

Got it. Can you send some snippets (or the entire) gtf files used. Or a few gtf files used so I can try to recreate the error?

mncfletcher commented 6 years ago

Here's the top 10000 lines from the first GTF processed - given the multi-proc run fails in the same way I assume that it should be obvious from this example..!

TACO_crash_IndexError_example.gtf.gz

If you need the full GTF please ask and I'll get that on the way.

yniknafs commented 6 years ago

awesome thanks. give me a bit and I'll debug.