HtsFile replacement - Githubissues

julesjacobsen commented 3 years ago

@mbaudis @pnrobinson

The PRC required a more extensible replacement for HtsFile. The comments and responses quoted below are from the PRC review.

PRC initial comment:

Even among “hts_files” there are multiple possibilities. There can be raw data reads, aligned reads, interpreted reads (particularly VCF). Genotype information can have different levels of evidence. Even with genotypes the diagnoses could be ambiguous or negative w.r.t. a test. Also, this could considered a “measurement”.

Suggested action:

Would like to see genomic or sequence information reported in the context of measurements. There needs to be some linkage between HTS and measurements or perhaps a superclass

Phenopackets team response:

The HtsFile is a reference to a File. This is not the same as a Measurement. This is another area where we hope that GA4GH will provide robust solutions. It is outside of the scope of the Phenopacket to specify this kind of information about sequence reads or variant calls. We do not consider a HTS to be a Measurement in the sense of the element in the Phenopackets, which is intended to model a single, well focused measurement such as "platelet count".

PRC response:

The PRC is unsatisfied with the current model, and feels it needs further revision to deal with scope concerns. They propose to generalize the element to "File" or "Associated/supported data files" to allow for greater extensibility. Documentation can then give examples of what these different files can be. The alternative would be to go the other direction and rigorously specify, but the PRC prefers a generalized file.

Phenopackets team response:

HTS files are present to facillitate genomic analysis in the context of phenotype. Unfortunately there is inadequate information in these files to state which genome assembly the file was called against and there needs to be a mechanism to link any identifiers used in the files with that of the individuals in the phenopacket/family. These are present in the HtsFile class. Providing a generic 'File' without any context on what sort of data they encode or how they relate to the subject of the phenopacket will be significantly more confusing and open to misinterpretation than the current HtsFile. This was addressed by the v1 PRC.

A meeting was had ....

... outcome: Please make something more extensible, yet with context.

So this is where we discuss and hack it out.

@mbaudis @avsmith @jdylan @jgoecks

julesjacobsen commented 3 years ago

@lindsmith can you add the other PRC members GitHub handles to the initial comment, please.

lindsmith commented 3 years ago

For added context, notes from the PRC + Dev Team meeting about HTSfile can be found here

mbaudis commented 3 years ago

Our main, condensed summary suggestion for a rapid solution was:

provide a wrapper for files, to make it extensible
document the use for your preferred file formats and - optionally / if you really see a need - limit them by e.g a list of currently supported types
future use may the see a GA4GH standard for file objects

But hts is both too generic and too restrictive.

(These are the condensed opinions, not my individual opinion.)

julesjacobsen commented 3 years ago

For context This is the HtsFile:

// A file in one of the HTS formats (https://samtools.github.io/hts-specs)
message HtsFile {

  enum HtsFormat {
    UNKNOWN = 0;
    SAM = 1;
    BAM = 2;
    CRAM = 3;
    VCF = 4;
    BCF = 5;
    GVCF = 6;
    FASTQ = 7;
  }

  // URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444
  string uri = 1;
  // description of the file contents
  string description = 2;
  // format of the HTS file
  HtsFormat hts_format = 3;
  // Genome assembly the contents of this file was called against. We recommend using the Genome Reference Consortium
  // nomenclature e.g. GRCh37, GRCh38
  string genome_assembly = 4;
  // A map of identifiers mapping an individual to a sample in the file. The key values must correspond to the
  // Individual::id for the individuals in the message, the values must map to the samples in the file.
  map<string, string> individual_to_sample_identifiers = 5;
}

Usage docs: https://phenopacket-schema.readthedocs.io/en/v2/htsfile.html#rsthtsfile

and links to usage in context of the top-level elements: https://phenopacket-schema.readthedocs.io/en/v2/cohort.html#hts-files https://phenopacket-schema.readthedocs.io/en/v2/family.html#hts-files https://phenopacket-schema.readthedocs.io/en/v2/phenopacket.html#hts-files

julesjacobsen commented 3 years ago

A couple of possible options:

Replace it with a generic file:

message File {
string uri = 1;
map<string, string> stuff = 2;
string description = 3;
}

Wrap it in a descriptor:

message FileDescriptor {
  File file = 1;
  some = 2;
  other = 3;
 properties = 4;
}

Not sure what these other properties might be.

pnrobinson commented 3 years ago

It would be good to have some use cases/requirements for this. The existing HtsFile, inelegant as it is, does what it needs to do, but I do not understand what is missing. In any case, I would suggest that

message DataFile {
 string uri = 1;
 map<string,string> attributes = 2;
}

is sufficient. A description can be put into the map if desired, and we do not want to encourage people to use free text more than needed.

mbaudis commented 3 years ago

@pnrobinson Would look o.k. to me, but I would then add example(s) for the files you see as most likely.

Options not in hts would be e.g .gff3 or .bed (which is/becomes a GA4GH standard); or generic columnar, structured text files (log2 ratio tables...). Or .CEL for array raw data etc. Or MPEG-G...

jdylan commented 3 years ago

EGA uses the SRA schema, which is also shared with ENA and NCBI. https://github.com/enasequence/schema/tree/master/src/main/resources/uk/ac/ebi/ena/sra/schema , though EGA also has an array object. The way it works is the RUN, ANALYSIS, or ARRAY describes the process which produced the file, and also stores relevant information pertaining to the file, such as assembly.

julesjacobsen commented 3 years ago

@mbaudis @avsmith @jdylan @jgoecks

We went with this:

message File {
    // URI for the file e.g. file://data/genomes/file1.vcf.gz or https://opensnp.org/data/60.23andme-exome-vcf.231?1341012444
    string uri = 1;

    // A map of identifiers mapping an individual to a sample in the file. The key values MUST correspond to the
    // Individual::id for the individuals in the message, the values must map to the identifiers(s) in the file.
    map<string, string> individual_to_file_identifiers = 2;

    // Map of attributes describing the file. For example the file format or genome assembly would be defined here. For
    // genomic data files there MUST be a 'genomeAssembly' key.
    map<string, string> file_attributes = 3;
}

Changes made to schema: 26a18c03a33ff17c065a3ae61c56536c597a012f 9dcf2fea254613fc986e4c4c91f5c0c749a14233

mbaudis commented 3 years ago

+1 from my side.

ianfore commented 3 years ago

The approach to revision of HtsFile might benefit for looking at DRS. One one hand DRS is a low level protocol that deals with access to files/objects. It lacks higher level context about what the files are. That is reasonable and can be better provided by Phenopackets, Data Connect and other models. On the other hand DRS has some relevance to considerations in this discussion

DRS provides a mechanism beyond the URI approach described above which allows file location to be maintained independently from identifying the file.
Has the generality to apply to any kind of file e.g. image files are already available via DRS
Provides a mechanism for indicating file type - though the implementation of this has been ambiguous (See this issue)

julesjacobsen commented 3 years ago

@ianfore can you link to where this is defined and give an example of these?

ianfore commented 3 years ago

DRS is defined here

Comprehensive examples could be provided, but here's a simple example A DRS id can take two forms host based e.g. drs://nci-crdc.datacommons.io/248a050b-6253-4aa9-8c83-028f2dce5438 or as a CURIE e.g. crdc:248a050b-6253-4aa9-8c83-028f2dce5438

Either of the above translate to a DRS API call as follows which in this case give two locations the file is available, and other data including type (given as mime-type) https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/248a050b-6253-4aa9-8c83-028f2dce5438

Here's a histopathology image example. The DRS id would be drs://nci-crdc.datacommons.io/06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f or crdc:06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f Which would be called via the API as https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects/dg.4DFC/06f7ed4f-95ec-4a5b-ae2a-8fe1e3d72d5f

The use of mime-type is currently ambiguous as noted above - but the hook is there to be made use of.

pnrobinson commented 3 years ago

At first glance it looks as if it would be easy to transmit DRS information using the map as given above.

phenopackets / phenopacket-schema

HtsFile replacement #307