Duplicates handling in miaTime

ChouaibB commented 1 year ago

Some datasets might have replicates for same subject/group per the same time point, as for instance "SilvermanAGutData" dataset has. Should current miaTime methods and upcoming ones be set to detect such cases and handle them accordingly (e.g. discarding, averaging ...)?

antagomir commented 1 year ago

Instead of separately dealing with this in every possible function, I would create a utility function that can be used to remove such duplicates. If needed. But not sure if this would make sense in general because the duplicate removal details may depend very much on each particular study.

As a first pass, perhaps just an example in an appropriate place in vignette, mentioning that this can be a problem (subsection perhaps?), and then showing how to deal with such situations and filter out duplicates is enough?

Daenarys8 commented 3 weeks ago

The task here is to add some example showing how to handle duplicate entries in coldata()?

data(SilvermanAGutData)
duplicated_rows <- colData(SilvermanAGutData)[duplicated(colData(SilvermanAGutData)), ]

unique_entries <- !duplicated_rows

tse <- SilvermanAGutData[, unique_indices]

antagomir commented 3 weeks ago

Did you try to run that code? It doesn't seem correct to me (duplicated_rows is DataFrame and not logical vector, yet you are taking its negation ("!"))? And "unique_indices" is not defined in the code (used in last row). Make sure the code examples work before pasting.

Anyways, the most relevant case could be to identify cases where time point is duplicated for a given grouping variable (e.g. subject). Like if subject A has two measurements on day 2 (or time point 42.1). These are potentially probematic cases and sometimes would need to be flagged and/or removed.

We could have a simple flagging function to detect such cases but not sure if this is worth a wrapper.

-> Prepare minimal reproducible example showing how to deal with such case including flagging and then removal?

Daenarys8 commented 3 weeks ago

you'r3 right, there was an oversight in the example. I was unable to find a dataset with duplicate entries for subjects and timepoints so I am going to create one with an existing dataset.

# Load data
data("hitchip1006")
tse <- hitchip1006

# duplicate data for example
library(mia)
tse2 <- mergeSEs(tse, tse)

# Now there should be multiple entries in tse2 for  "subject" and "time"
# retain unique entries
tse <- tse2[, !duplicated(colData(tse2)[, c("subject", "time")])]

we can use duplicated() against coldata to catch the multiple entries across subjects and timepoints

# use duplicated to identify the duplicates.

duplicated_entries <- duplicated(colData(tse2)[, c("subject", "time")])
flagged_duplicates <- tse2[, duplicated_entries]

antagomir commented 3 weeks ago

Great - though I think we are more interested in picking the non-duplicated set.

Perhaps the example could be simplified a bit into just:

# Flag duplicates (subject has multiples entries in the same time point) 
flagged.duplicates <- duplicated(colData(tse2)[, c("subject", "time")])
# Pick the non-duplicated entries 
tse2 <- tse2[, !flagged.duplicates]

Hmm sometimes also other variables might be necessary to consider in addition subject and time. For instance, subject + time + bodysite.

I am not sure if it is easy to make any sensible added-value wrapper to flag or remove duplicates so let's keep it like this.

It is said the "SilvermanAGutData" has duplicates on time + subject. If this holds, check if there are examples with this data set in OMA and see if it is sensible (or not) to remove duplicates as processing step.

TuomasBorman commented 3 weeks ago

I think we should also take this into account in our methods. Requiring user to remove data is suboptimal. At least, this could be done in divergence functions.

library(mia)

# Create dummy data
tse <- makeTSE()
assay(tse, "counts", withDimnames = FALSE) <- matrix(rnorm(nrow(tse)*ncol(tse), 0, 1), nrow = nrow(tse), ncol = ncol(tse))
colData(tse)[["group"]] <- c("A", "A", "A", "A")
colData(tse)[["timepoint"]] <- c(0, 0, 1, 1)

Give warning if there are duplicated time points


# Give warning
if( anyDuplicated(colData(tse)[, "timepoint"]) ){
    warning("Duplicated")
}

When reference samples are assigned, we can assign all samples from previous time point. This means that the vector is longer than there are samples (some samples have multiple time points). We extend the TreeSE.

# Add reference samples
reference <- c(NA, NA, "sample1", "sample2", "sample1", "sample2")
names(reference) <- c("sample1", "sample2", "sample3", "sample3", "sample4", "sample4")
tse_mod <- tse[, names(reference)]
tse_mod[["reference"]] <- reference

We can calculate divergence just like before.

# Calculate divergence
colnames(tse_mod) <- make.unique(colnames(tse_mod))
res <- getDivergence(tse_mod, reference = "reference", method = "euclidean")
names(res) <- names(reference)

After calculating divergences, we can summaries them by taking mean or median etc.


# Calculate mean divergence
res <- sapply(unique(names(reference)), function(x){
    mean(res[ names(res) == x ], na.rm = TRUE)
})
colData(tse)[["divergence"]] <- res

Daenarys8 commented 2 weeks ago

Hmm, there is no example in OMA using SilvermanAGutData. Also the coldata is not straightforward. here are the colnames:

[1] "SampleID"             "BarcodeSequence"      "LinkerPrimerSequence" "PrimerID"             "Project"             
 [6] "DAY_ORDER"            "Vessel"               "SampleType"           "Pre_Post_Challenge"   "Normal_Noise_Sample" 
[11] "Description"

the only time variable here is DAY_ORDER which is consequently the timepoint. For the subject, it is unclear. Any ideas?

antagomir commented 2 weeks ago

Regarding @TuomasBorman comments:

1) duplicates are not necessarily about duplicated time but also about duplicated combinations of group(s), time. Time points may be same for many subjects, therefor time can be easily duplicated but this is only an issue if it is duplicated within the same subject. Therefore duplicates should be checked for colData(tse)[, c("group", "time")] rather than just colData(tse)[, c("time")].

2) Averaging for multiple baseline measurements could be useful sometimes but not sure if it is helpful to implement before someone has a real use case.

On @Daenarys8 comment:

3) Time field name can be arbitrary, I think this should be defined by the user.

In conclusion, my suggestion is to either just ignore this issue for now and close it, or add a minimal example in miaTime vignette on how to deal with duplicates (perhaps even an example on averaging over samples like Tuomas demonstrated). Open to other suggestions.

Daenarys8 commented 1 week ago

A vignette on averaging over samples will be good. Can you suggest a dataset to use for this example I could prepare/get one which has these duplicates? @TuomasBorman

TuomasBorman commented 1 week ago

It seems that Silverman data in miaTime has duplicated time points for each vessel

microbiome / miaTime

Duplicates handling in miaTime #69