Open julesjacobsen opened 3 years ago
@lindsmith can you add the other PRC members GitHub handles to the initial comment, please.
For added context, notes from the PRC + Dev Team meeting about HTSfile can be found here
Our main, condensed summary suggestion for a rapid solution was:
But hts
is both too generic and too restrictive.
(These are the condensed opinions, not my individual opinion.)
For context This is the HtsFile:
// A file in one of the HTS formats (https://samtools.github.io/hts-specs)
message HtsFile {
enum HtsFormat {
UNKNOWN = 0;
SAM = 1;
BAM = 2;
CRAM = 3;
VCF = 4;
BCF = 5;
GVCF = 6;
FASTQ = 7;
}
// URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444
string uri = 1;
// description of the file contents
string description = 2;
// format of the HTS file
HtsFormat hts_format = 3;
// Genome assembly the contents of this file was called against. We recommend using the Genome Reference Consortium
// nomenclature e.g. GRCh37, GRCh38
string genome_assembly = 4;
// A map of identifiers mapping an individual to a sample in the file. The key values must correspond to the
// Individual::id for the individuals in the message, the values must map to the samples in the file.
map<string, string> individual_to_sample_identifiers = 5;
}
Usage docs: https://phenopacket-schema.readthedocs.io/en/v2/htsfile.html#rsthtsfile
and links to usage in context of the top-level elements: https://phenopacket-schema.readthedocs.io/en/v2/cohort.html#hts-files https://phenopacket-schema.readthedocs.io/en/v2/family.html#hts-files https://phenopacket-schema.readthedocs.io/en/v2/phenopacket.html#hts-files
A couple of possible options:
Replace it with a generic file:
message File {
string uri = 1;
map<string, string> stuff = 2;
string description = 3;
}
Wrap it in a descriptor:
message FileDescriptor {
File file = 1;
some = 2;
other = 3;
properties = 4;
}
Not sure what these other properties might be.
It would be good to have some use cases/requirements for this. The existing HtsFile, inelegant as it is, does what it needs to do, but I do not understand what is missing. In any case, I would suggest that
message DataFile {
string uri = 1;
map<string,string> attributes = 2;
}
is sufficient. A description
can be put into the map if desired, and we do not want to encourage people to use free text more than needed.
@pnrobinson Would look o.k. to me, but I would then add example(s) for the files you see as most likely.
Options not in hts would be e.g .gff3 or .bed (which is/becomes a GA4GH standard); or generic columnar, structured text files (log2 ratio tables...). Or .CEL for array raw data etc. Or MPEG-G...
EGA uses the SRA schema, which is also shared with ENA and NCBI. https://github.com/enasequence/schema/tree/master/src/main/resources/uk/ac/ebi/ena/sra/schema , though EGA also has an array object. The way it works is the RUN, ANALYSIS, or ARRAY describes the process which produced the file, and also stores relevant information pertaining to the file, such as assembly.
@mbaudis @avsmith @jdylan @jgoecks
We went with this:
message File {
// URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444
string uri = 1;
// A map of identifiers mapping an individual to a sample in the file. The key values MUST correspond to the
// Individual::id for the individuals in the message, the values must map to the identifiers(s) in the file.
map<string, string> individual_to_file_identifiers = 2;
// Map of attributes describing the file. For example the file format or genome assembly would be defined here. For
// genomic data files there MUST be a 'genomeAssembly' key.
map<string, string> file_attributes = 3;
}
Changes made to schema: 26a18c03a33ff17c065a3ae61c56536c597a012f 9dcf2fea254613fc986e4c4c91f5c0c749a14233
+1 from my side.
The approach to revision of HtsFile might benefit for looking at DRS. One one hand DRS is a low level protocol that deals with access to files/objects. It lacks higher level context about what the files are. That is reasonable and can be better provided by Phenopackets, Data Connect and other models. On the other hand DRS has some relevance to considerations in this discussion
@ianfore can you link to where this is defined and give an example of these?
DRS is defined here
Comprehensive examples could be provided, but here's a simple example A DRS id can take two forms host based e.g. drs://nci-crdc.datacommons.io/248a050b-6253-4aa9-8c83-028f2dce5438 or as a CURIE e.g. crdc:248a050b-6253-4aa9-8c83-028f2dce5438
Either of the above translate to a DRS API call as follows which in this case give two locations the file is available, and other data including type (given as mime-type) https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/248a050b-6253-4aa9-8c83-028f2dce5438
Here's a histopathology image example. The DRS id would be drs://nci-crdc.datacommons.io/06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f or crdc:06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f Which would be called via the API as https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f
The use of mime-type is currently ambiguous as noted above - but the hook is there to be made use of.
At first glance it looks as if it would be easy to transmit DRS information using the map as given above.
@mbaudis @pnrobinson
The PRC required a more extensible replacement for
HtsFile
. The comments and responses quoted below are from the PRC review.PRC initial comment:
Suggested action:
Phenopackets team response:
PRC response:
Phenopackets team response:
A meeting was had ....
... outcome: Please make something more extensible, yet with context.
So this is where we discuss and hack it out.
@mbaudis @avsmith @jdylan @jgoecks