nf-core / smrnaseq

A small-RNA sequencing analysis pipeline
https://nf-co.re/smrnaseq
MIT License
74 stars 125 forks source link

Remove Duplicates in mirtop.tsv removes more lines than expected #174

Open DwinGrashof opened 2 years ago

DwinGrashof commented 2 years ago

Description of the bug

There is a difference between the mirtop.tsv and mirna.tsv count files. The difference between these two files comes from the R script collapse_mirtop.r where the counts for isomirs per miRNA are summed after removing UID duplicates using the code:

counts = as.data.table(df[!duplicated(df[["UID"]]),c(3, 13:ncol(df))])
mirna = counts[, lapply(.SD, sum), by = miRNA]

However, some of the duplicated UIDs are removed but I don't know why you would choose to do so. For example, in miRNA hsa-miR-7-5p I have a duplicated UID with the following stats and counts:

UID                 read                        miRNA           Variant     S1     S2     S3    S4     S5     S6
iso-24-9BOX7EZWKE   TGGAAGACTAGTGATTTTGTTGTT    hsa-miR-7-5p    NA          35385  0      0     0      0      0
iso-24-9BOX7EZWKE   TGGAAGACTAGTGATTTTGTTGTT    hsa-miR-7-5p    NA          0      22370  42567 24653  0      57649
iso-24-9BOX7EZWKE   TGGAAGACTAGTGATTTTGTTGTT    hsa-miR-7-5p    NA          0      0      0     0      62297  0

There are more examples like this for hsa-miR-7-5p, where duplicated UIDs with identical variants are removed, although they have quite some counts. This reduces the counts from 150 000 to 70 000 in mirtop and mirna files respectively. Is the removal of these duplicated UIDs on purpose, or maybe a bug where the above UID should be:

UID                 read                        miRNA           Variant   S1    S2      S3      S4      S5      S6
iso-24-9BOX7EZWKE   TGGAAGACTAGTGATTTTGTTGTT    hsa-miR-7-5p    NA        35385 22370   42567   24653   62297   57649

Command used and terminal output

N/A

Relevant files

Added are two files: mirtop.tsv <- the results from mirtop mirna.tsv <- the deduplicates results after collapse_mirtop.r mirna_mirtop_files.zip

System information

No response

lpantano commented 2 years ago

Hi @DwinGrashof,

can you check on this code and look at the table this generates to see if it seems better:

library(data.table)
input="~/Downloads/mirna_mirtop_files/mirtop.tsv"
df = read.delim(input[1], sep = "\t")

dt=as.data.table(df)
# summarize by UID and miRNA name
tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)]]
# remove duplicates
tmp=as.data.frame(tmp)
counts = tmp[!duplicated(tmp[["UID"]]),c(2:ncol(tmp))]
counts=as.data.table(counts)
mirna = counts[, lapply(.SD, sum), by = miRNA]

I can push this into dev if this seems to address the issue and not mess up with other things :)

DwinGrashof commented 2 years ago

Hi @lpantano ,

Sorry for my late response, I've been busy during these holiday times.

I've looked through the code and it does seem to work better in most cases than the current mirna code. I do notice though, that some UIDs (iso-22-U4K7BB8Z & iso-22-FPJYUP6XO for example in my dataset).

hsa-miR-365a-3p and hsa-miR-365b-3p have the same counts with all the same UIDs in my dataset. Because they only have common UIDs all duplicated UIDs are removed, resulting in 0 counts for hsa-miR-365a-3p and only counts for hsa-miR-365b-3p.

For UID iso-22-FPJYUP6XO counts are found for both hsa-miR-107 and hsa-miR-103a-3p. Since these miRNA have both overlapping and unique UIDs, both are reported, but counts for hsa-miR-107 are much lower compared to when all its UIDs are summed.

These "problems" are however very much debatable. Retaining counts from UIDs which are in essence ambiguous, where we do not know which miRNA is actually expressed, if not both. There are of course multiple methods to handle these kinds of problems. In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities. Other options could be removing all not-unique UIDs after the tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)], line. Since this line does nicely get rid of a previous issue stated in my initial post.

This is just something to think about and I completely understand if you would rather keep it as coded in your previous post :)

Best, Dwin

atrigila commented 2 months ago

Hi @DwinGrashof and @lpantano, I think the key issue here is the following: "In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities."

This is due to how the function duplicated() works in R in vectors (in our case, the UIDs). The first occurrence of each element is marked as FALSE, therefore, it is the only one that is returned.

miRNA_names <- c("hsa-let-7a", "hsa-miR-21", "hsa-miR-155", 
                 "hsa-let-7a", "hsa-miR-21", "hsa-miR-223", "hsa-miR-155")
dup_flags <- duplicated(miRNA_names)
dup_flags 

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

From the updated code from Lorena and the suggestions from Dwin, I propose to replace the line to remove duplicates for this one: filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)], which will retain only the UIDs that are unique (and not the first occurence of those UIDs), which would make the final code look like this:

df <- read.table("~/R/mirtop.tsv", header = TRUE)
dt <- as.data.table(df)
# Summarize the data by UID and miRNA
tmp <- dt[, lapply(.SD, sum, na.rm = TRUE), by = .(UID, miRNA), .SDcols = colnames(dt)[13:ncol(dt)]]
# Identify and filter out non-unique UIDs
filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)]
# Summarize the unique data by miRNA 
mirna <- filtered_tmp[, lapply(.SD, sum, na.rm = TRUE), by = miRNA, .SDcols = 3:ncol(filtered_tmp)]
# Write mirna table
write.table(mirna, file.path(dirname(input[1]), "mirna.tsv"), quote = FALSE, sep = "\t", row.names = FALSE)

Let me know what you think.

lpantano commented 2 months ago

We are going to remove a lot of good data if we do that. There are many sequences that mapped to the same mature but coming from different precursors. We shouldn't remove those lines.

The case of 365 family, I wouldn't be worried, since the sequence it is exactly the same, it would be nice to say that probably that miRNA, could be one or another.

For the second case, I totally understand.

It is a good point, I don't think we keep only uniques because it would remove a lot of data that will mislead the results. We need to think a better way to solve this without removing duplicates but trying to be fair to the expression of different possibles miRNA.

I would suggest to leave this for Barcelona Summit. I can be in charge to solve this issue, I have done some other strategies that are better than this and I could adapt it to the pipeline.