input only 2~3 isomiRs / missing mismatch for 3'trim variation

kicheolkim commented 4 years ago

Hi, I found 2 possible error/bug while using the package. I think first one is isomir package issue, but the second one may be related to miRTop.

input only 2~3 isomiRs

I was able to input my data successfully. But I realized the count was not matched with original input. For example in one miRNA, original input (rawData from miRTop) has over 100 different types of isomiRs. But I found only 3 of them in the input data. It's not cutoff since I see very low counts, but it seems that only the top 2 or 3 expressed isomirs were loaded into R dataset. Could you check this error? Let me know if you need any information or my data. Data I tested for input is here, and I used this for input dataset:

de <- data.frame(row.names=c("S400001603","S400001616","S400001617"), condition = c("cc","bb","aa"))
mirData <- IsomirDataSeqFromRawData(read_tsv("mirtop_rawData.tsv"), de)

I saw many warnings during input, is it possible that is related to this warning?

Warning messages:
1: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect
2: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect

missing mismatch for 3'trim variation

I also found an error in the rawData file exported from miRTop. If the sequence has 5' or 3' variation, mismatch (nucleotide variation) was not recorded together. Could you also check about it? Here is an example... Those sequences have different mismatches and 3' trimming together, but mism is 0 in the table.

seq	mir	t3	S400001603	S400001616	S400001617
TCAGGTAGTAGGTTGTAT	hsa-let-7a-5p	agtt	2	1	1
TGAGGTAGTAGGTTGTAA	hsa-let-7a-5p	agtt	5	6	11
TGAGGTAGTAGGTTGTAT	hsa-let-7a-5p	agtt	197	166	351

I used the development version for both miRTop and isomiR package which I installed last week.

Thanks!! Kicheol

svattathil commented 4 years ago

I have noticed something similar to the first issue described by @kicheolkim. I am using isoMirs version 1.16.2 and output from miraligner version 3.5, and observe that the counts I get from an IsomirDataSeq object created using IsomirDataSeqFromFiles differs from the counts I calculate using base R code. I also observe that the results differ using IsomiRs functions when I load different files at a time. I have attached some test files and some test code. Is there perhaps some filtering happening, or lines dropped during merging?

Thanks for making this package! Happy to provide more information if it would help.

rm(list=ls())
library("isomiRs")

## create file list
FILELIST <- as.list(paste0("test", 1:5, ".mirna"))

## convenience variables
own <- "own"
isom5 <- "isom5"
isom2 <- "isom2"
isom1 <- "isom1"

## use base R functions
origdats <- lapply(FILELIST, read.table, sep="\t", header=TRUE)

## use IsoMiRs package functions
covars <- data.frame(filename = as.character(FILELIST))
rownames(covars) <- paste0("test", 1:5)
obj5 <- IsomirDataSeqFromFiles(FILELIST, covars, uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj2 <- IsomirDataSeqFromFiles(FILELIST[1:2], head(covars,2), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj1 <- IsomirDataSeqFromFiles(FILELIST[[1]], head(covars,1), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)

## compare
stats <- vector(mode="list")
stats[[own]] <- data.frame(mappedtotal = sapply(origdats, function(x) { sum(x$freq) }))
stats[[isom5]]  <- data.frame(mappedtotal = apply(counts(isoCounts(obj5)), 2, sum))
stats[[isom2]] <- data.frame(mappedtotal = apply(counts(isoCounts(obj2)), 2, sum))
stats[[isom1]] <- data.frame(mappedtotal = sum(metadata(obj1)[["rawData"]]$test1)) # error when applying isoCounts --  doesn't work with just one sample?
stats

testfiles.zip

svattathil commented 4 years ago

After looking at the code I bit, I see that there is some internal filtering happening (e.g. clean_noise and remove_gt_n_changes applied to raw data), which explains the discrepancy. Would be helpful to have these filters noted in the documentation, and perhaps have an option to output the raw data without any filtering. But it is also useful to have the filters!

lpantano commented 3 years ago

Hi,

I totally missed this, it was when somehow I got turned off notification, so I apologize. It was not intentional.

I need to look a the code, it is true is due to some filtering: see this for more explanation https://github.com/lpantano/isomiRs/issues/19#issuecomment-800281458, the parameters that affect the filtering are those.

The mirtop issue is more concerning. Did you use mirtop to convert from other tool? if you could share the command you use, I can do more, or the inputs. Sorry I missed this, happy to help if still needed.

cheers

kicheolkim commented 3 years ago

Hi, I'm glad you come back. I'm using sRNAtoolbox. So, I converted mirtop from sRNAtoolbox output. Here are the command lines that I used.

mirtop gff --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 --format srnabench -d --out-format gff -o ./ ./sample1 ./sample2 ./sample3

mirtop export -o ./ --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 mirtop.gff

mirtop counts --gff mirtop.gff --hairpin hairpin.fa --gtf hsa.gff3 -o ./

lpantano commented 3 years ago

Thank you. I am aware that data coming from that tool can generate some issues. Can you check those sequences in the output of sRNAbench and see how they are defined? Normally Mir top will take the definition from the tool, so I am wondering how are defined from the very beginning.

Let me know, I am very intrigued. You can share the output file as well from sRNAbench if you would like.

kicheolkim commented 3 years ago

Sure, here is the information of 3 isomiRs. You can also find sRNAbench (sRNAtoolbox) output from here. Hopefully, this would be helpful to solve the problem.

sequence matureName hairpinName isoLabel sequenceVariant readCount RPMlib RPMtotal TCAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 2 1.1136043475113726 0.2590028081084455 TGAGGTAGTAGGTTGTAA hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 5 2.7840108687784317 0.6475070202711137 TGAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 197 109.69002822987021 25.511776598681884

Thank you!!

lpantano commented 3 years ago

Hi, thank you for this. Actually, I can see how these three sequences are annotated only as 3p variants with -4 as the nucleotide variation (lv3p|lv3pT|lv3p#-4). Because, mirtop, only translates outputs, it doesn't check 5p or mismatches, because are not described in the original file. We can add some warnings, but it becomes complex to validate other tools.

Let me know your thoughts. Thanks!

lpantano / isomiRs

input only 2~3 isomiRs / missing mismatch for 3'trim variation #17

input only 2~3 isomiRs

missing mismatch for 3'trim variation