A bunch of questions - Githubissues

ZheFrench commented 7 years ago

Hey, I'm playing with vast-tools to analyze a pair of replicates of RIBO0 RNA-SEQ STRANDED PAIREND. 2 replicates in each condition .

vast-tools align unT1_REP1_2016_1.fastq.gz unT1_REP1_2016_2.fastq.gz -c 8 -o OUT
vast-tools align unT1_REP2_2016_1.fastq.gz unT1_REP2_2016_2.fastq.gz -c 8 -o  OUT

vast-tools align MANT_REP1_2016_R1.fastq.gz MANT_REP1_2016_R2.fastq.gz -c 8 -o OUT
vast-tools align MANT_REP2_2016_R1.fastq.gz MANT_REP2_2016_R2.fastq.gz -c 8 -o OUT

vast-tools combine -o /OUT -sp Hsa -a hg38  

vast-tools compare OUT/INCLUSION_LEVELS_FULL-Hsa4-hg38.tab --no_plot -a MANT_REP1_2016_R1,MANT_REP2_2016_R1 -b unT1_REP1_2016_R1,unT1_REP2_2016_R1 --paired

At this end o the analysis, when I compare I have a lot of RI(I mean A LOT)...mainly because of the fact its ribo0, but I have only 3 Micro exons...So I was wondering if I was doing something badly ? I know for sure there should be more Cassette Exons . ( From papers looking same cell lines and I also tested with dataset with other softs Rmats, Suppa, Whippet etc...)

I'm just wondering what i 'm missing...

Thanks

mirimia commented 7 years ago

Hey,

What cell line is it? If it's not a neural one I guess it's expected. How many exons do you get in the output summary of compare in the third and forth column? (these are the total number of events with good coverage that have been compared; the third is the total number and the forth is the number that have a 10<PSI<90 in any of your samples).

Have you tried to validate the events that are called by other software for which vast-tools tells you there are no differences? ;-)

M

ZheFrench commented 7 years ago

AS_TYPE Higher_in_MANT_REP1_2016 Higher_in_unT1_REP1_2016 Microexons 2 1 305 41 Long_AltEx 0 0 0 0 Intron_ret 41 3344 92650 22078 Alt_3ss 0 0 0 0 Alt_5ss 0 0 0 0

It's MCF10A cell line , epithelial cell transformed into mesenchymal cell after tamoxifen treatment. Yep some of them but clearly there is always a lot of false positive...hard to to tell the true ones without rt-qpcr validation :-) But 0 Long_AltEx, 22078 Intron Ret...Humm really weird, so I was wondering where I could have done an error in the process ? Is it normal for paired-end , it keeps only the name of the forward read (*_REP1) in the combine and final output after combine ? So from the summary, I understand it founds 305 mico events in both MANT and UNT, 41 of them have psi between 10-90, and only 3 have differrence between the two condition. By the way my reads have a length of 125 nucleotides, I didn't change the threshold of 50 to cut the reads in the alignment pocess, I don't know if it can play a role somewhere..

Thanks

mirimia commented 7 years ago

Hey,

Clearly, something is wrong. For some reason, it seems the modules to quantify exon skipping have failed. 0,0,0,0 means there is nothing there, which obviously can't happen. Do you have the output summary of vast-tools? I don't even understand how it's possible you manage to get an INCLUSION table. (could you tell me how many lines that table has, btw?)

RE some specific questions:

Is it normal for paired-end , it keeps only the name of the forward read (*_REP1) in the combine and final output after combine ? YES. You can rename them all, of course. But it just automatically grabs the first name.
I don't know if it can play a role somewhere.. NO. The length is irrelevant. It always trims to 50nt.

ZheFrench commented 7 years ago

Sorry for the late reponse, I wasn't at work and won't be until friday. 264504 lines in the INCLUSION_LEVELS_FULL-Hsa4-hg38.tab DiffAS have 3389 lines, 3MIC and the rest is IR only :/ One of the .eej2 is empty inside the to_combine directory. It shouldn't ? right ? Raw_incl & Raw_read seem ok. So I can run again on this condition the align part to be sure...

mirimia commented 7 years ago

Yes. Something went wrong in "align". If you have the report, I can take a look. But honestly I'll remap everything.

ZheFrench commented 7 years ago

Yeah right. In one condition, one of the _R2 was missing. But "Combine" process the rest even if there is an error before ... thanks for your help :) Last question, I want to classify them, and I was wondering if you have a snippet to share for filtering/classify the events. Because clearly I will parse the output to keep the ones that seem more robust in each of the replicates per condition.

mirimia commented 7 years ago

I'm not sure what you meant by "classify"... could you elaborate?

ZheFrench commented 7 years ago

Good question. I should have think to it before asking :)

I'm thinking to a strategy so not clear yet even for me...I'd like to try to remove false positive...to keep the ones that are probably the best ...I preferred to ask before parsing/changing the file.

Each events is classified in a complexity group. S, C1, C2, C3 based of score of each replicates in control and test conditions. So C3 is better than S because percent of complex reads is higher or it's the opposite ?

Each replicate will have 5 scores and that was the point that disturb me. I wanted to take account of them to refine. I'm used to treated RMATS output and I'm just using reads that validates inclusion/exclusion.

I was thinking that I'll remove events where scores at not the same in one of the replicates, remove VLOW or where score 4 (ratio on junction up/down exon) is different between replicates.

Finally, I will just compute a mean psi per condition, filter out events with a Meandpsi <= 0.2 , try a fisher test with inclusion/exclusion reads in @ inc,exc section. Then I ll sort by -log10 (fisher pvalue), and also monitor the number of reads.

By the way, your are not taking account the fact the library can be stranded , so you are counting all the reads ?

Thanks a lot for you insights and sorry to bother you with sometimes naive questions but that helped me to go deeper in the use of vast-tools.

UBrau commented 7 years ago

Hi ZheFrench,

In order to get a more high-confidence set of differential events, I recommend taking a look at the diff module which is parallel to compare. It takes the number of reads and the difference between replicates into account (fitting a beta distribution to each condition based on the single PSI values and read counts that represents the likelihood distribution of the real PSI in that condition, then compares the overlap between the distributions) and writes a table to STDOUT with a point estimate for dPSI for each event, plus the estimated minimum dPSI at a given significance level.

What I like to do is to require a minimum |dPSI| for the point estimate (column 'E[dPsi]') plus any minimum change other than 0 for the 'MV[abs(dPsi)]_at_0.95'. I also like to increase the -s value for increased robustness, but that affects compute time, which is the weak point of diff. As a by-product, you'll also get an estimate for the PSI in each condition that likely is a bit more realistic than a simple average, since PSI values are highly non-uniformly distributed.

ZheFrench commented 7 years ago

Thanks @UBrau I'll give a try but the note in the differential analysis part of the documentation scared me a little bit ...

"IMPORTANT NOTE: diff is still an experimental part of this package and is currently under development and testing. Please use at your own knowledge and risk."

So I did not go so far in the analysis process... Thanks

mirimia commented 7 years ago

Hey,

So, some answers:

Each events is classified in a complexity group. S, C1, C2, C3 based of score of each replicates in control and test conditions. So C3 is better than S because percent of complex reads is higher or it's the opposite ?

R: No one is better than the other. They just have different biological meaning. S is "simple" (i.e. the classic single cassette exon); C3 is complex (e.g. a mutually exclusive exon).

Each replicate will have 5 scores and that was the point that disturb me. I wanted to take account of them to refine. I'm used to treated RMATS output and I'm just using reads that validates inclusion/exclusion.

R: You don't have to use the scores (in fact, you probably shouldn't). I suggest you use only the first score, and the number of reads after the \@.

I was thinking that I'll remove events where scores at not the same in one of the replicates, remove VLOW or where score 4 (ratio on junction up/down exon) is different between replicates.

R: strongly not recommended. Remove only those that have N in the first score, or also VLOW if you want to be stringent.

Finally, I will just compute a mean psi per condition, filter out events with a Meandpsi <= 0.2 , try a fisher test with inclusion/exclusion reads in @ inc,exc section. Then I ll sort by -log10 (fisher pvalue), and also monitor the number of reads.

R: I also don't like that. Sorting by -log10 of p-val you will end up with a list of events "nicely" ranked by the expression of the host gene. I personally use the module "compare" more than anything. It's a very simple test, but so far it works well biologically. But everyone can have its own opinion. You can run "compare", and then filter out those that do not have a certain p-value.

By the way, your are not taking account the fact the library can be stranded , so you are counting all the reads ?

R: no, sorry.

ZheFrench commented 7 years ago

Great.

Yeah fisher can be biaised by the gene expression but then when doing rt-qpcr it's easier/probably more chance to validate when you have good expression rate...but I see your point.

To resume, play with @reads exclusion/inclusion , keep fisher not to sort but to filter out based on pvalue threshold and look only in first score in each replicate to eventually remove VLOW or NA.

Thanks

timbitz commented 7 years ago

@ZheFrenchKitchen, Just to clarify that statement:

"IMPORTANT NOTE: diff is still an experimental part of this package and is currently under development and testing. Please use at your own knowledge and risk."

This is experimental because we haven't yet thoroughly examined the sensitivity and specificity of the method, not because it isn't based on sound statistics or isn't something that we use. If you want to incorporate the variance introduced by higher or lower read counts, diff is definitely better than an ad hoc solution, or anything frequentist imo. So while we call it 'experimental', I think @UBrau has already used this method in one of his publications, for example. It is also quite similar to the implementation that I provide as default in Whippet.jl, which I am not likely to change too much, and will benchmark soon.

vastgroup / vast-tools

A bunch of questions #63