NULL genes when using custom annotations

ghost commented 4 years ago

Dear All,

I am having some troubles in using psichomics. I would like to use a combined reference with events retrieved from VAST database and from SUPPA. When I process the 2 annotations with parseVastToolsAnnotation and parseSuppaAnnotation and prepare the corresponding annotations everything work fine and I get a list of events with the proper IDs, genes and coordinates. However, when I try to combine both references in a single annotation with the command prepareAnnotationFromEvents, the genes of SE, A3SS and A5SS are read as NULL, while the ALE, AFE and MXE are retained correctly. I suppose this comes from some issues in processing and combining VAST events, since the retained events are the ones retrieved only from SUPPA. I noticed that VAST uses the official gene symbol in its templates, while for SUPPA I retrieved the events from an Ensembl GTF file, keeping the Ensembl GeneID as gene name. Therefore I replaced the gene symbol with the Ensembl GeneID in the VAST template files, but I keep having the same issue...SE, A3SS and A5SS are registered with NULL genes. Also, the custom annotation appear too heavy to be loaded in the GUI version, is that normal? I have some concerns also regarding other functions of psichomics. When I run the quantifySplicing command with or without filtering for read counts I don´t see any difference in the numbers of retrieved events. Am I supposed to see an actual reduction in events listed or the ones that don´t pass the threshold are simply marked as NA? I am sorry, I found difficult to assess this second hypothesis in large datasets. Ultimately, there seem to be some issues when more than 2 samples are present, but the dataset includes more condition. In my case I have 3 conditions, say A,B and C, with 3 replicates each (9 in total) and when running quantifySplicing specifying only 2 conditions, e.g AvsB (A 3 replicates and B 3 replicates) the remaining replicates seem to be randomly assigned to the conditions being analyzed so that I get A (4 rep) vs B (5 rep) for instance, although each replicate has specified the condition it belongs to. I apologize if that was already mentioned somewhere in the tutorials, I couldn´t find it.

Thanks in advance!

nuno-agostinho commented 4 years ago

Hello @Lilly88! I am currently on vacation but I will try to see what is going on in the next week. Sorry for the inconvenience!

ghost commented 4 years ago

Hello @Lilly88! I am currently on vacation but I will try to see what is going on in the next week. Sorry for the inconvenience!

Thanks! enjoy your vacation in the meanwhile! :)

nuno-agostinho commented 4 years ago

Hey there! Back from vacation! 😃

First of all, it would be great if you could run sessionInfo() and share the output here.

Therefore I replaced the gene symbol with the Ensembl GeneID in the VAST template files, but I keep having the same issue...SE, A3SS and A5SS are registered with NULL genes.

I am not getting what you describe in psichomics 1.14.4: the genes from both VAST-TOOLS and SUPPA annotation seem to show as expected in any type of alternative splicing event after running prepareAnnotationFromEvents(). If you don't mind, could you please send me the objects returned from parseVastToolsAnnotation() and parseSuppaAnnotation() using the function saveRDS()? I just need a sample to replicate the issue you are having. Thank you.

Also, the custom annotation appear too heavy to be loaded in the GUI version, is that normal?

How big is your custom annotation? How many events? How large is the file size?

When I run the quantifySplicing command with or without filtering for read counts I don´t see any difference in the numbers of retrieved events. Am I supposed to see an actual reduction in events listed or the ones that don´t pass the threshold are simply marked as NA? I am sorry, I found difficult to assess this second hypothesis in large datasets.

Yes, the PSI values of alternative splicing events whose total counts are lower than your threshold are marked as NAs. An alternative splicing event is only discarded if the PSI values of all samples in your dataset for that event are NA. So it is expected to have less number of events when you use a lower number of samples.

Ultimately, there seem to be some issues when more than 2 samples are present, but the dataset includes more condition. In my case I have 3 conditions, say A,B and C, with 3 replicates each (9 in total) and when running quantifySplicing specifying only 2 conditions, e.g AvsB (A 3 replicates and B 3 replicates) the remaining replicates seem to be randomly assigned to the conditions being analyzed so that I get A (4 rep) vs B (5 rep) for instance, although each replicate has specified the condition it belongs to.

I am not understanding what do you mean here. quantifySplicing() does not receive information on conditions, it quantifies splicing for all samples and per alternative splicing event. Do you mean when you are performing differential splicing analysis using diffSplicing()? Could you show me a sample of code with your problem? Thank you.

I am looking forward to your reply! Cheers.

nuno-agostinho commented 4 years ago

Hey @Lilly88, any news? Could you work out your problems or do you still need help? :)

ghost commented 4 years ago

Hi Nuno!

sorry for disappearing and thanks a lot for have checked on my issue with psichomics. it turned out I had a not updated version of R so I was working with an old version of psichomics. With the psichomics 1.14.4 the genes ID are kept when I combine events from VAST-TOOLS and SUPPA.

I am not understanding what do you mean here. quantifySplicing() does not receive information on conditions, it quantifies splicing for all samples and per alternative splicing event. Do you mean when you are performing differential splicing analysis using diffSplicing()?

Yes, sorry for the confusion, I meant the diffSplicing option. It still happens that if I have data from multiple conditions and I want to analyse only 2 of them the program assign the data from the remaining conditions to the ones analyzed. Es: data <- loadLocalFiles("~/Desktop/psichomics/data") names(data) names(data[[1]]) sampleInfo <- data[[1]]$Sample metadata junctionQuant <- data[[1]]$Junction quantification geneExpr <- data[[1]]$Gene expression

annot <-readRDS("~/Desktop/psichomics/annotation.RDS") filter <- filterGeneExpr(geneExpr, minCounts=10, minTotalCounts=15) geneExprNorm <- normaliseGeneExpression(geneExpr, geneFilter=filter, method="TMM", log2transform=TRUE) psi <- quantifySplicing(annot, junctionQuant, minReads=10) types <- createGroupByAttribute("Type", sampleInfo) A <- types$A B <- types$B C <-types$C AvsB <- c("A", "B") diffSplicing <- diffAnalyses(psi, AvsB)

Each condition has 3 replicates, for 9 in total. The data of all 9 samples are in the data folder. If I run diffSplicing as indicated, trying to compare only A and B I see that the sample number is not 3 and 3 but higher, for example 4 and 5 as if the replicate of condition C get assigned to conditions A and B. This does not happen if I run diffSplicing with all 3 conditions:

AvsBvsC <- c("A", "B", "C") diffSplicing <- diffAnalyses(psi, AvsBvsC)

or if I remove the replicates of condition C from the sampleInfo, junctionQuant and geneExpr. It is not a big deal, I can do these modifications and work without problems, it is just that it would be easier for me to simply select the conditions I want to test instead of having to test them all or modify the input data. Let me know if this is possible or if I am doing something wrong.

Thanks again!!! :)

nuno-agostinho commented 4 years ago

Hey, @Lilly88!

The way you are doing (diffSplicing <- diffAnalyses(psi, AvsB)), you are assigning A and B to all odd and even samples, respectively.

The object types is a list containing the sample identifiers for each group. You just have to select from this object the groups you want to compare. For instance:

diffSplicing_allTypes <- diffAnalysis(psi, types)                   # Compare all types (same as A vs B vs C)
diffSplicing_AvsBvsC  <- diffAnalysis(psi, types[c("A", "B", "C")])
diffSplicing_AvsB     <- diffAnalysis(psi, types[c("A", "B")])
diffSplicing_AvsC     <- diffAnalysis(psi, types[c("A", "C")])

Please contact me again if you have any doubt! Cheers! :)

ghost commented 4 years ago

Hi Nuno!

Now I understand! That makes my task way easier, thanks a lot for the explanation and for the great work you have done with psichomics.

Thanks again! :)

nuno-agostinho / psichomics

NULL genes when using custom annotations #416