phenopackets / phenopacket-schema

Repository for the GA4GH phenopacket schema
https://phenopacket-schema.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
76 stars 29 forks source link

htsFiles - URI and DRS? #142

Closed allisonheath closed 5 years ago

allisonheath commented 5 years ago

If phenopackets are going to reference actual data objects/files, it should likely use a URI. For a local file this could be file://, and that would make it relatively easy to use the emerging GA4GH DRS standard of drs:// for remotely located files: https://github.com/ga4gh/data-repository-service-schemas.

Also the htsFiles appears to be in the cancer example - but no where else in the docs? https://phenopackets-schema.readthedocs.io/en/latest/cancer-example.html#htsfiles

julesjacobsen commented 5 years ago

https://github.com/phenopackets/phenopacket-schema/blob/48839671d421bff3ab08b90925713124daeb60db/src/main/proto/org/phenopackets/schema/v1/core/base.proto#L371-L379

We have a URI already. Do you mean there should only be a URI?

There are htsFiles referenced in the Phenopacket and Family messages too:

https://github.com/phenopackets/phenopacket-schema/blob/48839671d421bff3ab08b90925713124daeb60db/src/main/proto/org/phenopackets/schema/v1/phenopackets.proto#L39-L40

https://github.com/phenopackets/phenopacket-schema/blob/48839671d421bff3ab08b90925713124daeb60db/src/main/proto/org/phenopackets/schema/v1/phenopackets.proto#L67-L68

mattions commented 5 years ago

I think it could be good to have three values:

1) a uri which will also cover the file path (file://data/genomes/file1.vcf.gz)cover the file path 2) drsto integrate with cloud stream 3) a description.

 // A file of unspecified type. 
 message File { 
     // DRS identifier https://github.com/ga4gh/data-repository-service-schemas  
     string drs = 1; 
     // URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444 
     string uri = 2; 
     // description of the file contents 
     string description = 3; 
 } 

Is it possible to make them mutually exclusive?

Because, once you have a DRS, you do not really need anymore the uri (would be bring some info that is redundant.)

allisonheath commented 5 years ago

Ah - I didn't see the uri because the examples seem to only use path and didn't dig through the code :)

I don't think we should added a third drs field, but my opinion is just to use the uri field, with DRS recommended when available.

I could be on the fence about whether path is supported as well - but I feel like as an interchange standard the assumption should often be the resources are not local and URI feels a bit more ergonomic for that purpose?

julesjacobsen commented 5 years ago

@mattions but isn't a DRS URI just a type of URI? e.g.

drs://example.com/ga4gh/drs/v1/objects/{object_id}

or a file URI for a local resource:

file://data/genomes/file1.vcf.gz

or a web resource:

https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444

With just a URI field all you need to do is check that the URI is valid, then have a resolver to figure out where to look for the object.

julesjacobsen commented 5 years ago

So I think this is resolved as a non-issue? Please re-open if you disagree.