tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
77 stars 12 forks source link

sharkfin plot too many data points and difficult to visualize #228

Open Rohit-Satyam opened 5 months ago

Rohit-Satyam commented 5 months ago

Hi, I was trying to make the sharkfin plot as you discussed in another issue. However, the shape of my plot doesn't match the shape you showed in the paper and the plot looks like the one shown below. These are 35K data points or perhaps you recommend splitting the result dataframe by transcript ID (i.e. ref_id Column). This is Sars-CoV-2

## Because ggplot doesn't like NAs
df<-file[,c("ref_id","ref_kmer","GMM_logit_pvalue_context_2","Logit_LOR")] %>% tidyr::drop_na()

df$Logit_LOR<- abs(df$Logit_LOR)

df<-df[order(df$GMM_logit_pvalue_context_2, df$Logit_LOR),]

df$color<-ifelse(df$GMM_logit_pvalue_context_2 <0.05 & df$Logit_LOR > 0.5 ,"Significant","Not-significant")

df$GMM_logit_pvalue_context_2<- -log10(df$GMM_logit_pvalue_context_2)
ggplot(df, aes(x=Logit_LOR, y=GMM_logit_pvalue_context_2,color=color)) + geom_point()+theme_minimal()+xlab("Logistic regression odds ratio")+ylab( "Nanocompore p-value (-log10)")

image

lmulroney commented 5 months ago

It does seem strange that you have a lot of sites which have significant p-values but low absolute values for the log odds ratio. Without knowing anything about your experimental design it can be challenging to give good advice on what this might mean. Maybe start be looking through the methods and supplementary information in this paper where we used Nanocompore on SARS-CoV-2 RNA?

https://doi.org/10.1016/j.omtn.2023.102052