phenopackets / phenopacket-format

26 stars 10 forks source link

Investigate approaches to patient identification #26

Open jmcmurry opened 8 years ago

jmcmurry commented 8 years ago

Should the standard make any effort to standardize the ways that patient identifiers are represented?

Eg. for a paper, is it adequate to say "patient 1"?

This is a bag of worms--vicious, slippery ones. Gahhhh ... I don't want to touch it with a 10-foot pole. But we should nevertheless at least park it as an issue for (much) later. We should first think about scenarios where machine-actionable identification of patients is important.

Situations that come to mind are the usual suspects: Deduplicating results of parallel text-mining / data integration pipelines

We are a long way off from when that is going to be the bottleneck.

jmcmurry commented 8 years ago

Related to https://github.com/monarch-initiative/phenopacket-format/issues/23 and https://github.com/monarch-initiative/phenopacket-format/issues/22

pnrobinson commented 8 years ago

This is an extremely important issue. If the phenopackets "take off", then we will be in a position to issue global patient IDs (or whatever database will receive the most phenopackets). Therefore, one strategy would be to dodge the issue for the initial publication, but to write in the discussion that any system of patient IDs is compaibnle with our format

id: scheme:ID

but that for the nonce we are using informal ID scheme (e.g., pubmed id plus string)

drseb commented 8 years ago

Agree that it could be mentioned that this is possible with the format, but not sure we should. There are so many issues, e.g. how to prevent patients being part of multiple papers without the authors noticing this.

jmcmurry commented 8 years ago

@drseb commented:

e.g., pubmed id plus string

This is fine with me, but let's for now at least try to delimit consistently and represent consistently. I'm partial to a hash delimiter. Not that I recommend that people actually resolve ad-hoc patient IDs, BUT resolving to the paper URL is perhaps better than to a 404. Moreover, the hash delimiter is used to jump to "part of" a page which is more or less what we're identifying anyway. So far, this is all we can actually do; we can't identify patients between papers.

Thus something like this PMC4498842#patient1

Any objections? cc: @cmungall

I guess one concern is that using hash delimited form would require closing the ID in quotes so as not to lose the fragment as a comment in the yaml format.