lpantano / isomiRs

analyze isomiRs from seqbuster tool
http://lpantano.github.io/isomiRs/
MIT License
8 stars 3 forks source link

input only 2~3 isomiRs / missing mismatch for 3'trim variation #17

Open kicheolkim opened 4 years ago

kicheolkim commented 4 years ago

Hi, I found 2 possible error/bug while using the package. I think first one is isomir package issue, but the second one may be related to miRTop.

input only 2~3 isomiRs

de <- data.frame(row.names=c("S400001603","S400001616","S400001617"), condition = c("cc","bb","aa"))
mirData <- IsomirDataSeqFromRawData(read_tsv("mirtop_rawData.tsv"), de)

I saw many warnings during input, is it possible that is related to this warning?

Warning messages:
1: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect
2: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect

missing mismatch for 3'trim variation

seq mir mism add t5 t3 S400001603 S400001616 S400001617
TCAGGTAGTAGGTTGTAT hsa-let-7a-5p 0 0 0 agtt 2 1 1
TGAGGTAGTAGGTTGTAA hsa-let-7a-5p 0 0 0 agtt 5 6 11
TGAGGTAGTAGGTTGTAT hsa-let-7a-5p 0 0 0 agtt 197 166 351

I used the development version for both miRTop and isomiR package which I installed last week.

Thanks!! Kicheol

svattathil commented 4 years ago

I have noticed something similar to the first issue described by @kicheolkim. I am using isoMirs version 1.16.2 and output from miraligner version 3.5, and observe that the counts I get from an IsomirDataSeq object created using IsomirDataSeqFromFiles differs from the counts I calculate using base R code. I also observe that the results differ using IsomiRs functions when I load different files at a time. I have attached some test files and some test code. Is there perhaps some filtering happening, or lines dropped during merging?

Thanks for making this package! Happy to provide more information if it would help.

rm(list=ls())
library("isomiRs")

## create file list
FILELIST <- as.list(paste0("test", 1:5, ".mirna"))

## convenience variables
own <- "own"
isom5 <- "isom5"
isom2 <- "isom2"
isom1 <- "isom1"

## use base R functions
origdats <- lapply(FILELIST, read.table, sep="\t", header=TRUE)

## use IsoMiRs package functions
covars <- data.frame(filename = as.character(FILELIST))
rownames(covars) <- paste0("test", 1:5)
obj5 <- IsomirDataSeqFromFiles(FILELIST, covars, uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj2 <- IsomirDataSeqFromFiles(FILELIST[1:2], head(covars,2), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj1 <- IsomirDataSeqFromFiles(FILELIST[[1]], head(covars,1), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)

## compare
stats <- vector(mode="list")
stats[[own]] <- data.frame(mappedtotal = sapply(origdats, function(x) { sum(x$freq) }))
stats[[isom5]]  <- data.frame(mappedtotal = apply(counts(isoCounts(obj5)), 2, sum))
stats[[isom2]] <- data.frame(mappedtotal = apply(counts(isoCounts(obj2)), 2, sum))
stats[[isom1]] <- data.frame(mappedtotal = sum(metadata(obj1)[["rawData"]]$test1)) # error when applying isoCounts --  doesn't work with just one sample?
stats

testfiles.zip

svattathil commented 4 years ago

After looking at the code I bit, I see that there is some internal filtering happening (e.g. clean_noise and remove_gt_n_changes applied to raw data), which explains the discrepancy. Would be helpful to have these filters noted in the documentation, and perhaps have an option to output the raw data without any filtering. But it is also useful to have the filters!

lpantano commented 3 years ago

Hi,

I totally missed this, it was when somehow I got turned off notification, so I apologize. It was not intentional.

I need to look a the code, it is true is due to some filtering: see this for more explanation https://github.com/lpantano/isomiRs/issues/19#issuecomment-800281458, the parameters that affect the filtering are those.

The mirtop issue is more concerning. Did you use mirtop to convert from other tool? if you could share the command you use, I can do more, or the inputs. Sorry I missed this, happy to help if still needed.

cheers

kicheolkim commented 3 years ago

Hi, I'm glad you come back. I'm using sRNAtoolbox. So, I converted mirtop from sRNAtoolbox output. Here are the command lines that I used.

mirtop gff --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 --format srnabench -d --out-format gff -o ./ ./sample1 ./sample2 ./sample3

mirtop export -o ./ --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 mirtop.gff

mirtop counts --gff mirtop.gff --hairpin hairpin.fa --gtf hsa.gff3 -o ./
lpantano commented 3 years ago

Thank you. I am aware that data coming from that tool can generate some issues. Can you check those sequences in the output of sRNAbench and see how they are defined? Normally Mir top will take the definition from the tool, so I am wondering how are defined from the very beginning.

Let me know, I am very intrigued. You can share the output file as well from sRNAbench if you would like.

kicheolkim commented 3 years ago

Sure, here is the information of 3 isomiRs. You can also find sRNAbench (sRNAtoolbox) output from here. Hopefully, this would be helpful to solve the problem.

sequence matureName hairpinName isoLabel sequenceVariant readCount RPMlib RPMtotal TCAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 2 1.1136043475113726 0.2590028081084455 TGAGGTAGTAGGTTGTAA hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 5 2.7840108687784317 0.6475070202711137 TGAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 197 109.69002822987021 25.511776598681884

Thank you!!

lpantano commented 3 years ago

Hi, thank you for this. Actually, I can see how these three sequences are annotated only as 3p variants with -4 as the nucleotide variation (lv3p|lv3pT|lv3p#-4). Because, mirtop, only translates outputs, it doesn't check 5p or mismatches, because are not described in the original file. We can add some warnings, but it becomes complex to validate other tools.

Let me know your thoughts. Thanks!