waldronlab / curatedMetagenomicDataCuration

Sample Metadata Curation for curatedMetagenomicData
https://waldronlab.io/curatedMetagenomicDataCuration/
28 stars 23 forks source link

File of duplicate samples #14

Closed schifferl closed 1 year ago

schifferl commented 6 years ago

From @lwaldron on December 8, 2017 21:49

@paolinomanghi would you create a spreadsheet of duplicates, perhaps like this? study1 sampleID1 study2 sampleID2

The assignment of "1" and "2" would be significant in that, when the user selects an option like removeduplicates=TRUE, the samples in study 2 would by default be the ones removed. I would relegate the smaller study to study 2, unless there is a better reason to think of one of the studies as the preferable default (like better metadata or data quality).

If a sample is duplicated in three studies, all three edges in that graph would have to be shown, meaning there would be three rows to identify the duplication of a single sample.

This spreadsheet can then be made into a documented dataframe in the data/ directory (DataFrame would be the Bioconductor way, although we've already gone rogue with a tbl_df combined_metadata).

Copied from original issue: waldronlab/curatedMetagenomicData#120

schifferl commented 6 years ago

From @paolinomanghi on December 9, 2017 12:30

Yes, I'm collecting all the necessities and organising them. Want to avoid any confusion in the future, and any difficulty in handling the whole dataset.

On Fri, Dec 8, 2017 at 10:49 PM, Levi Waldron notifications@github.com wrote:

Assigned #120 https://github.com/waldronlab/curatedMetagenomicData/issues/120 to @paolinomanghi https://github.com/paolinomanghi.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedMetagenomicData/issues/120#event-1379603457, or mute the thread https://github.com/notifications/unsubscribe-auth/AbN5V81eVwkwFrMFI_wvtwMGcWH0AKVMks5s-a7cgaJpZM4Q7sJe .

lwaldron commented 6 years ago

A note from issue #8, which I will close as a duplicate:

It seems that the "LiJ_2014" overlaps in some way with three other datasets: "LeChatelierE_2013", "NielsenHB_2014", "QinJ_2012".

paolinomanghi commented 6 years ago

Yes, it's just that I need to correct the metadata before compiling a good duplicate table, or at least this was the idea.

On Wed, Jan 10, 2018 at 11:56 AM, Levi Waldron notifications@github.com wrote:

A note from issue #8 https://github.com/waldronlab/curatedMetagenomicDataCuration/issues/8, which I will close as a duplicate:

It seems that the "LiJ_2014" overlaps in some way with three other datasets: "LeChatelierE_2013", "NielsenHB_2014", "QinJ_2012".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedMetagenomicDataCuration/issues/14#issuecomment-356568601, or mute the thread https://github.com/notifications/unsubscribe-auth/AbN5V78-C_j54PH_djHeG7jod6MFezcyks5tJJd4gaJpZM4RYAXl .

lwaldron commented 2 years ago

@paolinomanghi it would be great to include a table of known duplicates in the Bioconductor 3.15 release.

azenuser commented 1 year ago

I made the table as described in the first comment on this Issue. I uploaded it in curatedMetagenomicDataCuration/inst/extdata/duplicates_table.tsv

Let me know if anything else is needed, for now I'm closing.