gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

--merge disregards -G option? #350

Closed NoahHenrikKleinschmidt closed 2 years ago

NoahHenrikKleinschmidt commented 2 years ago

Hello there,

I am using STRINGTIE 2.2.0.

I have a set of GTF files generated from BAM files (created by STAR, 2.7.9a), that belong to an RNA-Seq experiment of three conditions Control, Gene-X Knockdown, and Knockout, respectively, and came from a stranded library. As I am doing DE analysis downstream I wish to generate a merged GTF file from the ones I already generated. However, the results produced by STRINGTIE vary markedly depending on how I approach the issue, and non of the results are what I hoped for. Here's the workflow I used:

I set out with a number of GTF files generated for my samples by: stringtie --rf -G $REF_ANNOTATION -o $OUTPUT_LOC -A $OUTPUT_ABUNDANCE_LOC -C $OUTPUT_FULLY_COVERED -m $MIN_LENGTH $sample_bam

Where $REF_ANNOTATION points to a gencode.v38.annotation.gff3 file, and $sample_bam is one of the sorted-by-coordinate BAM files that STAR generates. This step works like a charm, the output is a GTF file containing both referenced and potentially novel transcripts, with about a million lines in each file.

Subsequently, I intended to merge the generated GTF files into one by: stringtie --merge -o "$STRINGTIE_DIR/$OUT_NAME" -G "$REF_ANNOTATION" "$STRINGTIE_DIR/merge.index"

At this step, the merged file shows around two million lines (i.e. twice the number of any of the input GTFs; seems a bit odd to me). Given that the library is stranded I tried to specify --rf in the --merge as well (despite the fact the doc does not mention this as an option), and was promptly rewarded with a corresponding error.

However, the main issue is not that the number of merged lines seems too high, the issue that bugs me the most is that any reference information from the gencode GFF3 file is lost in the merged GTF file.

Since STRINGTIE accepts multiple BAM files, I though, I'd just pass all my BAM files to STRINGTIE directly so it would just output a single GTF file (for this approach I omitted any other options). stringtie --rf -o "$STRINGTIE_DIR/$OUT_NAME" -G "$REF_ANNOTATION" $bam_files

This actually did work and retained references nicely, but resulted in a GTF file of some 800'000 lines only. As I wrote this question I realised that I forgot to specify the same -m $MIN_LENGTH in the above command, which might explain the differences. For whatever reason the cluster is suddenly super busy, hence all my new slurm jobs are on queue so I could not repeat this to check again.

I am not really sure where I go wrong to get a merged GTF file that has references. I think it's odd for the merged GTF to have either twice as many entries as any of the individual ones, or less than any single one, depending on how you generate them. But my primary issue is still the missing references. Looking at the documentation or online I was not quite able to find a solution for this.

Maybe someone can help? I'll gladly provide any additional information needed, thanks in advance :-)

UPDATE: I had a rather weird day it would seem. The GFF3 file that was used was a linked version of the original file. When I pointed to the original location, the --merge seemed to retain the annotations after all (although I do not see any reason why this should have actually made a difference??). Anyway, I am happy to say, that I got a merged GTF file now that works for me :-)

One other issue came up though. As I tried to rerun Stringtie with all the BAM files, I kept getting a Memory Usage Error. This behaviour repeated even when I used standard minimum_length. The very odd thing is, that this very approach did work once (as outlined above) and failed at any subsequent try (I've tried about 10 times now, it keeps failing with the same memory error...) Any idea on why this might be?

gpertea commented 2 years ago

I would speculate that the original symlinked GFF3 file issue might be due to the fact that the symlink was invalid on the compute node, somehow, or pointing to something different (quite common issue in grid environments, where some paths available on login nodes might not be valid or the same with the same path on compute nodes).

As for the other issue -- no, please do not run stringtie with multiple samples (BAM files) at once. Not only that can create memory usage issues as you encountered (I suppose your grid job tried to allocate more memory than you promised to use when you submitted the job), but multiplexing samples this way actually breaks some important sampling/statistical assumptions that StringTie makes for its network flow algorithm in order to work correctly.

NoahHenrikKleinschmidt commented 2 years ago

Thanks a lot for the reply. I didn't know about stringtie not being well accustomed to multiple BAM files, I always assumed since it works, it's fine. Thanks for the info, I'll keep it in mind in the future :-)