ucsd-ccbb / qiimp

Web application to collect metadata specifications from an experimenter and produce metadata input files with appropriate constraints
3 stars 7 forks source link

Automatically assign sample_name? #124

Open adswafford opened 6 years ago

adswafford commented 6 years ago

Gail and I have recently run into issues where misread, mislabeled, or otherwise problematic (PHI) sample_names require us to manually fix issues and then reprocess data since sample_name is our key for connecting all files together.

To avoid this, we were considering making a sample_name column that is assigned a name automatically, e.g. MyStudy.Sample1. However, given issue #119 we would need to limit the characters in the MyStudy for both length and content and given the issue that we only allow up to 1000 samples to be entered at a time, and samples to be entered by type, each of these would create a scenario where we'd have to intelligently and silently catch, merge and resolve sample_names to prevent duplicates, so we may be stuck with users having to enter it and them just being responsible for not putting PHI in them.

@ackermag, what do you think?

ackermag commented 6 years ago

I believe this can be resolved (for duplicates) if LabMan is functioning. We should consider the sample naming scheme used by Jon and Luke for the EMP 500 which I believe will alleviate all of these concerns.

On Thu, Mar 8, 2018 at 9:33 PM, adswafford notifications@github.com wrote:

Gail and I have recently run into issues where misread, mislabeled, or otherwise problematic (PHI) sample_names require us to manually fix issues and then reprocess data since sample_name is our key for connecting all files together.

To avoid this, we were considering making a sample_name column that is assigned a name automatically, e.g. MyStudy.Sample1. However, given issue

119 https://github.com/ucsd-ccbb/cmi_metadata_wizard/issues/119 we

would need to limit the characters in the MyStudy for both length and content and given the issue that we only allow up to 1000 samples to be entered at a time, and samples to be entered by type, each of these would create a scenario where we'd have to intelligently and silently catch, merge and resolve sample_names to prevent duplicates, so we may be stuck with users having to enter it and them just being responsible for not putting PHI in them.

@ackermag https://github.com/ackermag, what do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ucsd-ccbb/cmi_metadata_wizard/issues/124, or mute the thread https://github.com/notifications/unsubscribe-auth/AB69gCfLQIthq34DigjsotcORxYkmoiZks5tchQSgaJpZM4Sjxro .

-- Gail Ackermann Knight Lab UCSD glackermann@ucsd.edu ackermag@ucsd.edu

AmandaBirmingham commented 6 years ago

The 36 character limit for QIIME 2 was specifically chosen " to be as short as possible while still supporting version 4 UUIDs formatted with dashes" (https://docs.qiime2.org/2018.2/tutorials/metadata/ ), and the whole point of UUIDs (universally unique identifiers) is that "UUIDs are for practical purposes unique, without depending for their uniqueness on a central registration authority or coordination between the parties generating them, unlike most other numbering schemes." (https://en.wikipedia.org/wiki/Universally_unique_identifier ). If we're now open to assigning sample name automatically, then it seems to me that UUIDs are the tool explicitly meant for the job ... but having to tack the study name onto the front of them bolloxes the 36 character limit.

adswafford commented 6 years ago

@antgonza does the 36 character limit apply to Qiime2 within Qiita or do you have some workaround you implement? And can we get your input on the best way to address this? It would be nice to support/implement UUID as @AmandaBirmingham suggested but the Qiita Study ID prefix that I think is there to ensure unique db IDs isn't compatible.

antgonza commented 6 years ago

AFAIK the limit is a suggestion and not imposed so we don't have anything to prevent it. Now, discussing with @adswafford, perhaps a good solution will be to just use the first string (before the first -) of the uuid (or other random char) and discard the rest. I think the possibility of collisions within a single study is pretty low and the code could check for that.

adswafford commented 6 years ago

I agree and yes, there should be a check to reassign an ID if a duplicate is generated.

On Wed, Apr 4, 2018, 10:46 AM Antonio Gonzalez notifications@github.com wrote:

AFAIK the limit is a suggestion and not imposed so we don't have anything to prevent it. Now, discussing with @adswafford https://github.com/adswafford, perhaps a good solution will be to just use the first string (before the first -) of the uuid (or other random char) and discard the rest. I think the possibility of collisions within a single study is pretty low and the code could check for that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ucsd-ccbb/cmi_metadata_wizard/issues/124#issuecomment-378684541, or mute the thread https://github.com/notifications/unsubscribe-auth/AZxBhi7D8RqCH_175PrvYSClojVwDpv7ks5tlQWEgaJpZM4Sjxro .

adswafford commented 6 years ago

Let's make this a must-have so we can get some feedback (for better or worse) during beta. We'll need to make a different column the trigger to populate presumably; so likely anonymized_name will work but would need to be bumped to the top before the alphabetical ordering of the columns.

AmandaBirmingham commented 6 years ago

This is an important issue, and I think its scope extends far beyond the metadata wizard!

As discussed above, sample name (or the "identifier", in QIIME2-speak), is embedded extensively throughout Qiita and QIIME2--as in, feature tables reference sample name, deblur results reference sample name, etc. Because of this, it appears to me that currently users really need to be able to interpret the sample names, because they need to select by them, interpret their results based on them, etc.

I think it is a wonderful idea to assign unique internal identifiers to each sample so that users can't screw up the primary key used to trace samples through these systems. Nonetheless, users still need to be able to examine their results by the identifier they care about. Until/unless Qiita (and QIIME?) enable users to link data and label samples via a user-provided identifier other than sample name, I think that assigning arbitrary sample names in the metadata wizard solves a problem in a few datasets (PHI in sample names) by creating a problem for every dataset (results hard to interpret because sample names mean nothing to the user/analyst).

I propose that a feature suggestion for arbitrarily assigned sample identifiers be submitted the to the Qiita and QIIME teams. Once they indicate that these systems support arbitrary sample identifiers while still allowing users to select and label samples by the user-preferred sample identifier, the metadata wizard can be expanded (probably trivially) to generate whatever sort of arbitrary sample identifiers they specify, and to capture the user-preferred sample names to whatever alternate field they designate.