Not fully output all reference transcripts with annotation file

charlin90 commented 9 years ago

Hi,I'm using stringtie to quantify reference transcripts expression with annotation file(UCSC GTF).However,only ~26000 reference transcripts were output(total 99900).I guess,it may be the influence of -c parameter.Confusingly,when I reset "-c 0.0",there is an error "Segmentation fault (core dumped)";when I reset "-c 0.1",there is no any error,but output transcripts remain ~26000,nearly euqal compared to "-c 2.5". I want to obtain all reference transcripts expression values(although 0),can you tell me how to do that using stringtie?and,why presented above confusing results when adjusted -c parameter?

Thank you very much Charlin

gpertea commented 9 years ago

Sorry, StringTie was devised to report only transcripts that could be assembled from the given evidence (read alignments) - i.e. "expressed" transcripts only. As opposed to Cufflinks, the -G mode does not skip the transcript assembly step just to perform expression estimation for the given transcripts. StringTie always tries to assemble the given read alignments into transcripts (sometimes with the help/guidance of the reference annotation) and only if the resulting assemblies match the reference transcripts it will make a note of that in the output (_referenceid attribute).

You are correct about the bug we have in the code for "-c 0.0" option - we should've checked that the value given is always >0.0 and prevent the crash. A value of 0 is actually not allowed for this option.

Even so, the -c option is not doing what you wanted and it's only applied as an initial pre-assembly filter during the bundle coverage assessment (i.e. when overlapping reads are grouped together in genomic regions before assembly). StringTie cannot assemble reads (reliably) if the ratio between the (covered) region length and the number of reads is too high (i.e. coverage too low). The default of 2.5 is quite loose, and by lowering that value one would only force StringTie into processing a few more sparse bundles formed by potentially spurious read alignments.. (which may also come from transcripts with expression so low that they could be practically considered "not expressed").

charlin90 commented 9 years ago

Thanks,I see.But,I feel confused about another parameter for "-m" option,given explanation(Sets the minimum length allowed for the predicted transcripts. Default: 200) seems ambiguous.I feel that simply means assembed transcripts are greater than or equal 200 in length.If so,some reference transcripts are seemly ignored and falsly give expression values with 0 in output t_data.ctab file.When I tried to set "-m 1" with annotation file,I found it is not apparently different in numbers for transcripts in output gtf file.Whether that means assembly with annotation file ignored this option compared to de novo assembly or this option don't influence assembly with annotation file and only limited in de novo assembly？However,accoding to your above reply, this option seems have an influence on assembly whatever any condition,yet our trial don't markedly confirm this expectation,why?If so, output t_data.ctab file included all reference transcipts seem falsly gave 0 as expression value for those ignored transcripts?And,why set this option for transcriptome assembly,it is seemly not very meaningful for obtaining all transcripts included those short RNAs? Thanks a lot again

gpertea commented 9 years ago

-m is indeed filtering the output to eliminate short assembled transcripts which might have been assembled from spurious read mappings; stringtie is targeting longer, complex transcripts that are usually hard to assemble from regular RNA-Seq data, instead of short RNAs (e.g. microRNAs) which are sequenced using a different protocol and are usually analyzed using simpler methods.

The user could lower this limit, but the default 200 does serve the purpose of filtering out isolated, likely incorrect read mappings. However this limit is ignored (or should be - we'll have to double check) when the annotation shows that in that bundle there are reference transcripts shorter than that and StringTie could assemble them with the exact same exon structure.

tantrev commented 8 years ago

For samples that are low-coverage (such as single-cells), is it advised to instead use Cufflinks over Stringtie? Luckily, I have some bulk cell population data but unfortunately all of the data I'm working with has short reads (25bp).

mpertea commented 8 years ago

Sorry, but we did not test StringTie nor Cufflinks on single-cell data.

gpertea / stringtie

Not fully output all reference transcripts with annotation file #4