vastgroup / vast-tools

A toolset for profiling alternative splicing events in RNA-Seq data.
MIT License
77 stars 28 forks source link

How to filter different AS event in two group sample outputted by diff module #80

Closed kehuangke closed 5 years ago

kehuangke commented 5 years ago

Dear Mirimia:

I have used diff module of VAST-TOOLS, but three are several things of result file I don't understand. There are about twenty thousands AS events outputted by diff module. In this, 80 samples in group one, 89 samples in group two.

  1. In the column named ' MV[dPsi]_at_0.95', most of them equal to 0. Only thirty AS events are not equal to 0.
  2. In the columns named 'E[dPsi]', more than 80% less than 0.1, even equal to 0.

image

It's too much AS events, how to filter the output result? @mirimia

Many thanks in advance Hike

mirimia commented 5 years ago

Honestly, with that many samples I’d use a wilcoxon test + a minimum absolute delta PSI difference (often >=15). For that, you can use tidy or your own script to make sure at least 10 or 20 samples per group have good coverage. If I have time, i will make a variant of tidy to be able to input the groups.

Re the use of diff, I am not sure it is appropiate without applying extra filters here (at least requiring N samples per group to have sufficient coverage; default is 1 sample only). Ulrich may be better to give better advice.

M

UBrau commented 5 years ago

Hi Hike,

The diff output contains all the events where there were at least --minSamples per group had at a coverage of at least --minReads. Out of those, most are expected not to be differentially spliced between the groups.

It seems that in your dataset, only 30 events are actually differentially spliced with 95% likelihood (this is specified by the -r/--prob option). The column 'MV[dPsi]_at_0.95' indicates the expected dPSI at this likelihood. You may choose to just look for events where this is different from 0, and take the point estimate for dPSI from the column 'E[dPsi]'.

That being said, we have never used diff with this many samples, and I don't know how it will perform. It is not impossible that it is overly stringent (although in our experience, when used with 3-4 samples per group, it performs well). You may try to lower the --prob a little and test if the additional differential events hold up in validation.

Ulrich

On 4/12/18 07:31, kehuangke wrote:

Dear Mirimia:

I have use diff module of VAST-TOOLS, but three are several things of result file I don't understand. There are about twenty thousands AS events outputted by diff module. In this, 80 samples in group one, 89 samples in group two.

  1. In the column named ' MV[dPsi]_at_0.95', most of them equal to 0. Only thirty AS events are not equal to 0.
  2. In the columns named 'E[dPsi]', more than 80% less than 0.1, even equal to 0.

[image]https://user-images.githubusercontent.com/45514301/49441767-6ac94b00-f802-11e8-9395-51aea7d8f69c.png

It's too much AS events, how to filter the output result? @mirimiahttps://github.com/mirimia

Many thanks in advance Hike

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/vastgroup/vast-tools/issues/80, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHh5WkyrI5w54uPwUzr6N0WozGuInBNoks5u1msSgaJpZM4ZAjoK.

kehuangke commented 5 years ago

Hi Hike, The diff output contains all the events where there were at least --minSamples per group had at a coverage of at least --minReads. Out of those, most are expected not to be differentially spliced between the groups. It seems that in your dataset, only 30 events are actually differentially spliced with 95% likelihood (this is specified by the -r/--prob option). The column 'MV[dPsi]_at_0.95' indicates the expected dPSI at this likelihood. You may choose to just look for events where this is different from 0, and take the point estimate for dPSI from the column 'E[dPsi]'. That being said, we have never used diff with this many samples, and I don't know how it will perform. It is not impossible that it is overly stringent (although in our experience, when used with 3-4 samples per group, it performs well). You may try to lower the --prob a little and test if the additional differential events hold up in validation. Ulrich On 4/12/18 07:31, kehuangke wrote: Dear Mirimia: I have use diff module of VAST-TOOLS, but three are several things of result file I don't understand. There are about twenty thousands AS events outputted by diff module. In this, 80 samples in group one, 89 samples in group two. 1. In the column named ' MV[dPsi]_at_0.95', most of them equal to 0. Only thirty AS events are not equal to 0. 2. In the columns named 'E[dPsi]', more than 80% less than 0.1, even equal to 0. [image]https://user-images.githubusercontent.com/45514301/49441767-6ac94b00-f802-11e8-9395-51aea7d8f69c.png It's too much AS events, how to filter the output result? @mirimiahttps://github.com/mirimia Many thanks in advance Hike — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#80>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHh5WkyrI5w54uPwUzr6N0WozGuInBNoks5u1msSgaJpZM4ZAjoK.

Hi, UBrau:

I see, I have tested to use 3 samples vs 4 samples, It can output many AS different events that 'MV[dPsi]_at_0.95' are not equal to 0. VAST-TOOLS seems don't perform well when sample is big.

After I use abs(E[dPsi])>0.02, the limit can filter out most of result, only all kinds of AS events about 7 thousands is retained. Is it a appropriate way to filter? Does the result filtered by this way is reasonable?

And, Is it appropriate way that setting the optional -r 0.9 ? Can you suggest me a reasonable way to rescue my result outputted by diff module? Or ,is it necessary to change another software?

I hope you can reply me.

Many thanks in advance Hike

kehuangke commented 5 years ago

Honestly, with that many samples I’d use a wilcoxon test + a minimum absolute delta PSI difference (often >=15). For that, you can use tidy or your own script to make sure at least 10 or 20 samples per group have good coverage. If I have time, i will make a variant of tidy to be able to input the groups.

Re the use of diff, I am not sure it is appropiate without applying extra filters here (at least requiring N samples per group to have sufficient coverage; default is 1 sample only). Ulrich may be better to give better advice.

M

Hi, Mirimia:

I haven't filtered every sample according to coverage before I use the diff module. I think It maybe a root of this problem. Can you suggest me a appropriate standard to filter those samples?

I have tested a new optional -r 0.9, is it reasonable ? Or, is it necessary to change another software?

Many thanks in advance Hike

mirimia commented 5 years ago

Hi Hike,

As with any analysis with any software, the best solution will depend on how your data are. There is no standard way, and you will have to try several reasonable approaches and see which ones give you the satisfactory solution.

In your case, I'd try the following:

etc.

Cheers Manu

UBrau commented 5 years ago

Hi Hike,

I would use abs(E[dPsi])>0 rather than something as arbitrary as 0.02. Alternatively, you can use something like 0.1, which would mean to obtain only those events for which there is a >95% likelihood that the change is as least 10 PSI. I may have been a bit unclear in my previous email: When I said to take the point estimate for dPSI from the column 'E[dPsi]', I meant to also use that as a criterion combined with abs(E[dPsi])>0, e.g. |dPSI| > 10. In the case of many samples, you could decide to lower the -r to 0.9.

As Manuel said, you should also filter your data for coverage and balance, which can be done using tidy. My suggestion usually is to run diff, also do the filtering, and then only report the changing or unchanging events among those that survived filtering.

However, if indeed this is too strict with many samples, I'll concur with Manuel about instead doing a Mann-Whitney U-test (after filtering).

Ulrich

kehuangke commented 5 years ago

Hi Hike,

As with any analysis with any software, the best solution will depend on how your data are. There is no standard way, and you will have to try several reasonable approaches and see which ones give you the satisfactory solution.

In your case, I'd try the following:

  • I'd filter the events based on coverage. For that you can use vast-tools tidy. I have implemented a new function, using --groups FILE, which allows you to select only those events with a minimum coverage in both groups. For that, you'll need to first update vast-tools (git pull) and then make a FILE that basically has your samples and the associated group like: Sample1\tGroupA Sample2\tGroupA Sample3\tGroupB

etc.

  • Run vast-tools tidy. I think I'd start with --min_N 10 --p_IR. I.e. minimum 10 samples per group must have sufficient coverage. You can be more or less restrictive here. (I've just implemented the group option but I didn't have time to properly test it. PLEASE make sure you take a look at a couple of events to make sure everythings works according to plan).
  • With the output table, you can do a wilcoxon test between the two groups and calculate the average PSI of each group.
  • I'd use those with p<0.05 (you may decide to Bonferroni correct or not) and |dPSI| > 15 (or 10 or 25).

Cheers Manu

Hi Mirimia:

THANKS! It works and perform better than diff module, even though there is still some little defects. The result can be used.

THANKS!

Hike