Merging bams for including novel splice sites makes a really big bam :(

aleighbrown commented 5 years ago

Hi,

I have about 30 samples which I want to do a differential splicing analysis on.

The instructions for including unannotated splice sites suggests that I first merge all my bams into a single bam, and then put that bam into the index call. However, if I merge all my aligned bams, we're talking about a bam which is like 200G, which is just going to be deleted eventually anyhow.

Is there a fast way around this problem you could recommend, e.g. like build multiple indexes on each bam or something hacky like this?

If the bam files are being just to find unannotated splice junctions could one not use instead of the bam the SJ.out.tab files output by STAR, for example rather than needing the bam file itself?

Or could one use these SJ.out.tab files to construct a fake gtf including the novel splicing sites?

timbitz commented 4 years ago

Hi @aleighbrown, Yes this is a caveat of the current system. In theory, Whippet could probably use the SJ.out.tab file in addition to a GTF file (which is required because it has txStart and txEnd positions and also the known full isoforms) to add the novel splice sites, but this would take a tid bit of implementation...

timbitz commented 4 years ago

Just to quickly add to this-- with such a big bam file you're going to want to increase the --bam-min-reads flag to the indexer to something more appropriate (the default is 1!), otherwise the index created is going to contain one-off cryptic splice sites and perhaps even alignment errors as well.

aleighbrown commented 4 years ago

The other possibility here would be using Cufflinks or Stringtie to make a merged gtf that includes the novel transcripts found in the bams and then building a Whippets index off of that. Do you have any thoughts as to if that would work?

timbitz commented 4 years ago

I don't know-- That would be crossing into uncharted territory. I suppose you could try and test it by simulating reads from a GTF file that is downsampled by removing exons or alternative splice sites?

itszhengan commented 4 years ago

Hi @aleighbrown aleighbrown and @timbitz timbitz recently I need to find novel junction from almost 30,000 samples. Any good advice on me please?

timbitz commented 4 years ago

@itszhengan, currently Whippet is not really designed to compare de novo splicing quantifications across large cohorts of samples like 30K. And if you were going to do that, I wouldn't build a single comprehensive index of all de novo splicing in a single merged bam file-- I would probably build one for each sample, and then compare the node structures or de novo junctions across the samples somehow (but this is not completely straight-forward as-is with overlapping nodes of CE/RI type, for example).

The only purpose of having a single index (as opposed to many) is to perform quantitative differential splicing analysis between two sets of samples-- but if the goal is to identify/study de novo splicing only, then I don't see how comparing splicing quantifications between two sets of samples, where one has an alternative splicing event and the other does not, really makes sense-- a qualitative comparison seems sufficient. The example in the documentation is more for analyzing various healthy tissues (or across poorly annotated species) where one should desire to build a single comprehensive index, where the annotation is lacking, to enable better quantitative comparisons.

I am still planning to make additions to Whippet specifically to analyze de novo splicing patterns across large cohorts of samples... but I haven't started yet, and am not sure when this will be available after I do.

itszhengan commented 4 years ago

@timbitz Thank you for your reply.

aleighbrown commented 4 years ago

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

itszhengan commented 4 years ago

Thank you for this information!

Zheng An Administrative Assistant China-Japan Union Hospital of Jilin University

Anna-Leigh Brown notifications@github.com 于2019年11月15日周五下午7:05写道：

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timbitz/Whippet.jl/issues/87?email_source=notifications&email_token=ALY6KVDHUDET32X5Q6HBI4LQTZ67TA5CNFSM4ILBSCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEFDO6I#issuecomment-554317689, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY6KVBHIZZT4IQH2HFDMG3QTZ67TANCNFSM4ILBSCBA .

itszhengan commented 4 years ago

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

Hi @aleighbrown aleighbrown have you tried "Use Cufflinks or Stringtie to make a merged gtf that includes the novel transcripts found in the bams"? I saw the gtf-reproducing action similar to what you said 5 years ago in this paper(https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0168-9) But I still don't know the justification and the potential impact. Several annotation-free tools have been proposed such as leafcutter because the lack of reference annotation file. However, those are not based AS events or don't consider intron retention. So I have to use the traditional annotation way to do AS event analysis. And it seems that Whippet is the only recent method that fits me. So do you have any suggestion?

aleighbrown commented 4 years ago

Might want to stop clogging this Github issue on this point since it's not Whippets issue per se; feel free to email or twitter DM on it

timbitz / Whippet.jl

Merging bams for including novel splice sites makes a really big bam :( #87

Thank you for this information!