statOmics / satuRn

satuRn is a highly performant and scalable method for performing differential transcript usage analyses.
https://statomics.github.io/satuRn/
20 stars 1 forks source link

Error fitDTU : Error in if (m == 0) { : missing value where TRUE/FALSE needed #29

Closed TdzBAS closed 1 year ago

TdzBAS commented 1 year ago

Hi Jeroen,

thanks for this tool!

I tried to follow the main vignette with my own dataset, but encountered the error mentioned above, after running fitDTU. Can you help me out here? Maybe you can reproduce the error?

my expression matrix: merged_TPM_saturn.txt

my meta_file: meta_saturn.txt

my code: saturn.txt

The data comes from GRCh37.p13, but i fetch for GRCh38. But I gues this shouldnt interfere with the overall functionality.

Looking really forward!

Cheers

jgilis commented 1 year ago

Hi @TdzBAS,

I was able to run your code and reproduce the error. When I run your code, I think you made a small mistake in generating your group variable. It seems that the sample label is still attached to the group label. See screenshot:

Screen Shot 2023-04-19 at 10 06 12

You have 600 observations in the data you gave me. With this code, you will ask satuRn to fit a GLM with as many covariates as observations, which is not possible. I understand that you probably did this because you were following our vignette. There, we indeed made an addtional variable called group, which was a combination of the brain region and cluster (in our case, cell type) labels. However, in your data, "cluster" seems to be a unique identifier of each sample within a region. Does this "cluster" variable have any biological meaning, i.e., are samples of cluster == "01" in regions "a" and "b" related? If not, you simply need to perform an analysis between regions "a" and "b". Hence, I suggest changing your code for the fitDTU function to formula = ~ 0 + region,. When I do this, the analysis runs in 2 seconds.

Best regards,

Jeroen

TdzBAS commented 1 year ago

Big Thanks!

That was actually the reason, and I was able to run it once with the StageR test at the end.

However, I noticed that in the vignettes, the function satuRn::testDTU has the parameter forceEmpirical. The documentation explains that if you have fewer than 500 features, you can calculate the empirical p-values with the TRUE flag. However, it seems that this parameter is no longer available (satuRn 1.7.3).

I actually have less than 500 features in one group. unfortunately, I do not get any empirical p-values anymore, still it is recommended. Plots also do not appear. I would really like to see the plots! Is it because I have less than 500 features?

Otherwise, everything seems to be running very well. So, I simply used the raw p-values for the assignment:

transcript level p-values from satuRn

pvals <- rowData(sumExp)[["fitDTUResult_Contrast1"]]$pval

Can I give you my new data and metafile for a sanity check?

Thanks a lot Jeroen!

Cheers, Tolga

jgilis commented 1 year ago

Hi @TdzBAS,

You should still be able to force getting empirical p-values in satuRn 1.7.3., by setting forceEmpirical = TRUE in the testDTU function.

Yes, if you send me the updated files I will run them.

Best regards,

Jeroen

TdzBAS commented 1 year ago

Hi @jgilis,

I tried it out. Also for the vignettes, but unfortunately I get the Error unused argument, when i want to use forceEmpiricial = TRUE.

Here are all my updated files (with raw p_value).

merged_TPM_saturn.txt meta_saturn.txt trysaturn.txt

Looking forward!

Best, Tolga

jgilis commented 1 year ago

Hi @TdzBAS,

I think that you are probably not on satuRn 1.7.3. then. Can you run packageVersion("satuRn") to see what version you are on?

TdzBAS commented 1 year ago

Hi @jgilis ,

Did so. It says 1.7.3. Have the same Error for 1.6.0 somehow.

Cheers

jgilis commented 1 year ago

Hi @TdzBAS,

I tried out your code and running the empirical correction and generating the diagnostic plots just worked after changing this original code

sumExp <- satuRn::testDTU(
  object = sumExp,
  contrasts = L,
  diagplot1 = TRUE,
  diagplot2 = TRUE,
  sort = FALSE
)

and just adding the last line:

sumExp <- satuRn::testDTU(
  object = sumExp,
  contrasts = L,
  diagplot1 = TRUE,
  diagplot2 = TRUE,
  sort = FALSE,
  forceEmpirical = TRUE
)

The first diagnostic figure looks like this:

Screen Shot 2023-04-21 at 08 42 15

In this case, I would actually advice against using the empirical correction. If you look at the diagnostic plot, it seems to have three very large modes. The one in the middle are the ones with a z-score around zero, so non-differentially expressed features. The one on the left are downregulated (lower usage in b.1 compared to a.1), those on the right are upregulated. But you see that actually a very large percentage of your features seems differential (the modes on the left and right combined are much bigger than that in the middle). This is probably good news for a biological point of view, because it means there is a large difference between the two conditions. But for the empirical correction to work, we need the assumption that the mid 50% of the z-scores are truly from genes that are non-DE. This is not the case. If the empirical correction would have worked, we'd expect the blue dashed curve to nicely follow the middle mode of the test statistics, whereas now it tries to blow up its variance to capture parts of the two other modes as well.

My questions/suggestions to you:

Best regards,

Jeroen

TdzBAS commented 1 year ago

Hi @jgilis,

thanks, very interesting!

This is a simulated dataset, where i have 600 samples in total (300 in each group) and each gene was filtered to two transcripts, and I have 200 genes in total. So I guess I have 400 features right (200 genes * 2 transcripts)?

I added the parameter like above, but again the error:

image

here is the sessioninfo file: session_info.txt

Thanks Jeroen! Best, Tolga

jgilis commented 1 year ago

Could you remove the comma after forceEmpirical = TRUE?

TdzBAS commented 1 year ago

Hi @jgilis, image unfortunately the same error persists. But as you described, the empirical p-value is not useful in my case. Therefore I would move on with the raw ones :) To my understanding, they get adjusted later in the StageR object right?

jgilis commented 1 year ago

Hi @TdzBAS,

It don't see why the argument would not work if you have v1.7.3, which is indeed the case given your session info. Could you try ?satuRn::testDTU in the console and see in the help file that should appear if forceEmpirical is one of the input arguments? If its not, please scroll down in the help file and reconfirm if it's satuRn v1.7.3. there.

The empirical correction and stagewise testing are trying to achieve something different. The empirical correction will, in stead of computing p-values based on the test statistics and their expected theoretical distribution, compute p-values based on the test statistics and an empirically observed distribution.

The stagewise testing will try to gain statistical power by alleviating the strength of the FDR multiple testing correction in a theoretically valid way. In the first testing stage, it will aggregate evidence across transcripts within the same gene. The test will be: is there any or more transcripts within my gene that are changing between conditions of interest. Then, only those genes are retained for the second stage, which will test for each gene which of its transcripts are undergoing a change between conditions. The FDR correction will be less stringent than if we would just do an FDR correction on all the raw p-values, because there are fewer tests (only for the transcripts that passed stage 1).

Now, one last note: If you only have two transcripts per gene, the stagewise testing transcripts p-values of transcripts that pass the first stage will actually all be zero if you use the "dtu" correction method (which is what you should do). If there are only two transcripts in a gene, we know that if we find one of the transcripts to be significantly changing in usage, the other must be, too, so it must also be significant. For instance if you look at the p-values of two transcripts within the same gene, they will be the same. This is because if the usage of tx1 goes up by 20% between condition A and B, the usage of tx2 must go down by 20% by definition. This is discussed in the Methods section of our paper, if you are interested in reading the details. In general, you can consider these features as significant. So if you have data where each of your genes only has 2 genes, which is not really realistic, you could use stagewise testing to see which are significant, but the significant ones all will have p-value = 0. So for prioritization of those significant features you would again have to look at their ranking based on the original p-values.

The empirical correction I would only use if you have more features and only a reasonable % of them (0%-20%-ish) are expected to be DE. The stagewise testing I would only recommend if you have a realistic distribution of the number of transcripts per gene. As scuh, I would for your specific case recommend to just look at the raw p-values that directly come out of satuRn.

Best regards,

Jeroen

TdzBAS commented 1 year ago

Hi @jgilis ,

it is not listed as an argument:

image

I couldnt find a help file to scroll down with the version. But sessioninfo, packageversion and environment says its saturn 1.7.3 I guess there is a mistake somewhere on my side..

when i am looking at my final output (padj), then all genes are included, which should mean that all transcripts passed the stage1 test right? But how can this be, if not alle genes have DTU in my dataset?

I see your point. So are my genes only significant when the p-value is 0? What is with very small p values which are under the usual p value thresholds. Wouldnt they count as significant if they arent exactly 0?

And for my case I am interested in the p-value for the genes that show DTU. And according to the vignette I dont have raw p values for the genes, so I have to use stageR testing like in the last section? Or what do you mean with using the raw p-values directly?

Thanks Jeroen!

Best, Tolga

jgilis commented 1 year ago

HI @TdzBAS,

Best,

Jeroen

TdzBAS commented 1 year ago

Hi @jgilis ,

thanks for alle the nice insights!

just to be clear here: If I want p values for all genes, and if I set the flag onlysignificantgenes to False, would I get then adjusted p values or raw p values for all the genes in the object padj?

and if they are adjusted, are they lenient, or is the FDR corrected on all the raw p values? If it is lenient, how can I do the adjustment on all the raw p values?

thank!

best Tolga

jgilis commented 1 year ago

The getAdjustedPValues function returns FDR adjusted p-values for the screening hypothesis and stage-wise adjusted p-values for the confirmation hypothesis p-values. For features that were not significant in the screening hypothesis, the confirmation stage adjusted p-values are set to NA.

The onlysignificantgenes doesn't change anything to this, it just hides results of the features that did not pas the first screening stage.

Best,

Jeroen

jgilis commented 1 year ago

Hi @TdzBAS ,

Are your issues resolved now? If so, I will close this issue.

Best regards,

Jeroen

TdzBAS commented 1 year ago

Hi @jgilis ,

thanks a lot! the issue can but closed :)