Closed schifferl closed 1 year ago
From @paolinomanghi on December 9, 2017 12:30
Yes, I'm collecting all the necessities and organising them. Want to avoid any confusion in the future, and any difficulty in handling the whole dataset.
On Fri, Dec 8, 2017 at 10:49 PM, Levi Waldron notifications@github.com wrote:
Assigned #120 https://github.com/waldronlab/curatedMetagenomicData/issues/120 to @paolinomanghi https://github.com/paolinomanghi.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedMetagenomicData/issues/120#event-1379603457, or mute the thread https://github.com/notifications/unsubscribe-auth/AbN5V81eVwkwFrMFI_wvtwMGcWH0AKVMks5s-a7cgaJpZM4Q7sJe .
A note from issue #8, which I will close as a duplicate:
It seems that the "LiJ_2014" overlaps in some way with three other datasets: "LeChatelierE_2013", "NielsenHB_2014", "QinJ_2012".
Yes, it's just that I need to correct the metadata before compiling a good duplicate table, or at least this was the idea.
On Wed, Jan 10, 2018 at 11:56 AM, Levi Waldron notifications@github.com wrote:
A note from issue #8 https://github.com/waldronlab/curatedMetagenomicDataCuration/issues/8, which I will close as a duplicate:
It seems that the "LiJ_2014" overlaps in some way with three other datasets: "LeChatelierE_2013", "NielsenHB_2014", "QinJ_2012".
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedMetagenomicDataCuration/issues/14#issuecomment-356568601, or mute the thread https://github.com/notifications/unsubscribe-auth/AbN5V78-C_j54PH_djHeG7jod6MFezcyks5tJJd4gaJpZM4RYAXl .
@paolinomanghi it would be great to include a table of known duplicates in the Bioconductor 3.15 release.
I made the table as described in the first comment on this Issue. I uploaded it in curatedMetagenomicDataCuration/inst/extdata/duplicates_table.tsv
Let me know if anything else is needed, for now I'm closing.
From @lwaldron on December 8, 2017 21:49
@paolinomanghi would you create a spreadsheet of duplicates, perhaps like this?
study1 sampleID1 study2 sampleID2
The assignment of "1" and "2" would be significant in that, when the user selects an option like
removeduplicates=TRUE
, the samples in study 2 would by default be the ones removed. I would relegate the smaller study to study 2, unless there is a better reason to think of one of the studies as the preferable default (like better metadata or data quality).If a sample is duplicated in three studies, all three edges in that graph would have to be shown, meaning there would be three rows to identify the duplication of a single sample.
This spreadsheet can then be made into a documented dataframe in the data/ directory (
DataFrame
would be the Bioconductor way, although we've already gone rogue with atbl_df
combined_metadata).Copied from original issue: waldronlab/curatedMetagenomicData#120