openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
11 stars 12 forks source link

Metadata does not contain an identity for the experiment #147

Open Ichoran opened 7 years ago

Ichoran commented 7 years ago

We have all sorts of detail about the experiment, but nothing that helps to identify the experiment. Of course we can't guarantee globally unique experiment IDs, but there is no other field that is obvious for this use. (Possibly a line in "protocol"?)

We should introduce a standard way to specify this in metadata, possibly "id" (since we already have per-worm IDs). The id should probably be a string.

cheelee commented 7 years ago

We could implement a mechanism for registering global IDs similar to tinyurl at the time experiments are added to a database under our control (or allow the same experiment to be assigned different IDs since we cannot control the registration process)

I do recall the WCON format was designed to allow for arbitrary snippets of movement data. Do we want to have IDs for those too?

MichaelCurrie commented 7 years ago

For uniquely identifying experiments, let's add a string "experiment_UUID" field to the "metadata" object.

https://en.wikipedia.org/wiki/Universally_unique_identifier

https://stackoverflow.com/questions/4230357/how-do-i-represent-a-guid-in-a-json-object

To the parser, we can add an function to generate a UUID if none is present. Then when the parser writes the WCON object to a file, the UUID will be saved in the metadata.

Ichoran commented 7 years ago

UUIDs are complicated. Why not provide an ID field and suggest that if you put a UUID in there (generated with standard methods), you'll be guaranteed it's unique? If you put "apple" in every time, well, results may not be so good. Since it will be an optional field anyway, we can't rely upon it for anything important. And the "snippets" point is a good one too.

Having a way to write a UUID into the metadata on file creation would be great, though!

MichaelCurrie commented 7 years ago

In the financial world, there is often the need for objects to have two ids:

  1. a universally unique id, and also
  2. an id that is useful (more human-readable) for the organization but only needs to be unique within that organization.

Here we could have id for the latter and worm_universal_id for the former, and the specification could recommend that organizations either populate worm_universal_id with a UUID or leave it blank and the writer will take care of creating one.

Now since we can have multiple worms specified in the same file, and "id" is used in each track in the data to distinguish them, but is not specified in the metadata object, we'd need a lookup object in the metadata. So in the metadata we would have a "worm_universal_ids" object to specify them:

"worm_universal_ids": {
    "1": "616386ea-f500-45a5-a2ef-1fe9b08f7040",
    "2": "132b192e-4270-4b8d-99ad-8793a267f3bf"
}

Or perhaps something more elegant?

Ichoran commented 7 years ago

If one wants a UUID for a worm's ID, why not just use the ID field (since we're making it be string-only anyway)? UUIDs are easy enough to recognize anyway, so it's not like you need a separate field to be able to tell whether there's an UUID or not.

MichaelCurrie commented 7 years ago

OK. And I'm realizing that this issue is about experiment ids, not worm ids. So perhaps what we really need is:

  1. a metadata field called experiment_id, a string, and
  2. in the spec we recommend that all id and experiment_id values use UUIDs, and
  3. the WCON reader can auto-populate a UUID for the experiment_id if one is not present
Ichoran commented 7 years ago

I'm all for giving people the tools to easily stick UUIDs in, but I think even a recommendation is a little strong. We want people to put the thing that will most help them identify the experiment in the experiment ID field. In case someone doesn't know about UUIDs, it's nice to alert them. Beyond that, I'd leave it to them. Of course, if you want to maintain a database, you may want to have additional restrictions (e.g. all experiments are identified by a UUID) for what goes into a submitted WCON file.