gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Mangled new lines in coverage gtf file #348

Closed OXB-DC closed 2 years ago

OXB-DC commented 2 years ago

Hi, I'm noting that StringTie 2.1.4 seems to mangle the output of the coverage gtf files (obtained with the -C option), in that some lines do not have a new line character added.

I am expecting 9 fields in the output files but sometime I get more. Easy to verify with: awk -F '\t' '{print NF}' <cov_refs.gtf> | sort -nu | tail -n 1

I have written a script (with some judicious use of tr and sed) to correct this which looks up the names of each chromosome and scaffold from the fasta file and uses them as line delimiters to repair the coverage gtfs (so I have got a workaround). This only works with GENCODE or UCSC named chromosomes though (prefixed with chr).

Is this something anyone else has noticed? I'm only using the coverage gtf to be conservative about which transcripts I report (only those covered end-to-end in reads).

gpertea commented 2 years ago

Does the latest 2.2.0 version exhibit the same issue on your data? I would rather not go back to track this bug in older versions if it has been fixed already.

OXB-DC commented 2 years ago

I will download and install latest version and set up a run to see. Was wondering if anyone else could at least run that awk command to check on any output they already have.

gsatas commented 2 years ago

I can confirm that I am observing this issue with StringTie v2.2.1.

It appears to stem from a multithreading race condition. Lines aren't just missing line breaks, but multiple threads seem to be writing to the same file concurrently. Here is a partial output where this is evident

NC_001134.8 RefSeq  transcript  420201  423041  .   -   .   ID=NM_001178434.1;CDS=420201:423041;CDSphase=0;geneID=YBR086CNC_001134.8    RefSeq  transcript  423765  424829  .   +   .   ID=NM_001178435.3;CDS=423765:424829;CDSphase=0;geneID=YBR087W;gene_name=RFC5;coverage=0.21
;gene_name=IST2;coverage=0.00

which should read

NC_001134.8 RefSeq  transcript  420201  423041  .   -   .ID=NM_001178434.1;CDS=420201:423041;CDSphase=0;geneID=YBR086C;gene_name=IST2;coverage=0.00
NC_001134.8    RefSeq  transcript  423765  424829  .   +   .   ID=NM_001178435.3;CDS=423765:424829;CDSphase=0;geneID=YBR087W;gene_name=RFC5;coverage=0.21

When using a single thread, the error does not occur, which further supports a race condition

gpertea commented 2 years ago

Thank you for confirming this, indeed it seems to be a thread synchronization/resource locking issue. -C was meant as an internal/checking option that we only used with the default single-threading run mode, did not expect to get wider use etc.

It's likely an easy fix but I am very busy at the moment so I cannot dev+test a fix right now. Pull requests for such fixes are always welcome.