Develop and test Kids First genomic file conformance resources

liberaliscomputing commented 4 years ago

Develop and test Kids First genomic file conformance resources. Consider the following steps:

[x] Consult with the followings about the discussed specification
- KFDRC FHIR model mapping spreadsheet
- FHIR DocumentReference
[x] Create a new kfdrc-genomic-file profile
[x] Create any necessary Extensions for added attributes
[x] Create SearchParameters for the above Extensions
[x] Create any necessary CodeSystems or ValueSets
[x] Bind the above ValueSets to the new Extensions
[x] Create an example resource
[x] Validate the above resources with the FHIR IG publisher
[x] Test ingest and search

liberaliscomputing commented 4 years ago

@nicholasvk and I did the initial pass of GF modeling together and found out the following issue:

Size To map GF's size, we decided to use DocumentReference.content.attachment.size. The IG Publisher threw an error while validating an example resource: our usual GF size overflows this field's range limit because it has unsignedInt as a data type which ranges between 0 and 2,147,483,647. To see how other institutions handle this issue, we investigated the following IGs:

Phenopackets' HtsFile: GA4GH modeled a high throughput sequencing file off of DocumentReference, but they didn't make any modifications to content.attachment. I wonder how they handle really big HTS Files.
AnVIL's drs-object: AnVIL also created an extension called DRSObject which is an attribute extension of Attachment. This extension has an attribute called size whose data type is integer. The only difference between integer and unassignedInt is that the former allows negative numerics (therefore, from −2,147,483,648 to 2,147,483,647). Thus, this still cannot handle our genomic files.

File format and data type To map these fields into DocumentReference, we decided to use DocumentReference.type. There is no official extension about this from the FHIR registry. Phenopackets also uses this attribute, binding a custom ValueSet called HTS Format. It only covers some fraction of file formats in the KF dataservice. We can import and extend it, but it doesn't cover data types. Or, we can take either of the following approaches:

2.1 Consider DocumentReference.type as a mixture of file format and data type

We can create the following CodeSystem which combines file format (as code) and data type (as display):

{
  "concept":[
    {
      "code":"BAI",
      "display":"Aligned Reads Index"
    },
    {
      "code":"BAM",
      "display":"Aligned Reads"
    },
    {
      "code":"CRAI",
      "display":"Aligned Reads Index"
    },
    {
      "code":"CRAM",
      "display":"Aligned Reads"
    },
    {
      "code":"DCM",
      "display":"Radiology Images"
    },
    {
      "code":"FASTQ",
      "display":"Unaligned Reads"
    },
    {
      "code":"gVCF",
      "display":"gVCF"
    },
    {
      "code":"MAF",
      "display":"Annotated Somatic Mutations"
    },
    {
      "code":"PDF",
      "display":"Gene Fusions"
    },
    {
      "code":"PDF",
      "display":"Radiology Reports"
    },
    {
      "code":"RSEM",
      "display":"Expression"
    },
    {
      "code":"SVS",
      "display":"Histology Images"
    },
    {
      "code":"TBI",
      "display":"gVCF Index"
    },
    {
      "code":"TBI",
      "display":"Variant Calls Index"
    },
    {
      "code":"TSV",
      "display":"Somatic Copy Number Variations"
    },
    {
      "code":"TSV",
      "display":"Gene Expression"
    },
    {
      "code":"VCF",
      "display":"Annotated Somatic Mutations"
    },
    {
      "code":"VCF",
      "display":"gVCF"
    }
  ]
}

This way, we can curate both file format and data type together. However, this approach has two problems that:

Some pairs are not necessarily bound together: BAM's being paired with Aligned Reads is obvious. For example, however, PDF's essence has nothing to do with Gene Fusions.
Some codes are repetitively used, for example, TBI, TSV, etc.

More importantly, we won't be able to pass the IG Publisher validation since each concept entry's code should be unique within a CodeSystem.

2.2 Alternative ways

Make two CodeSystems, one for file format and the other for data type, and bind them to a new ValueSet, and, in turn, bind this ValueSet to DcoumentReference.type. This way, we don't have to worry about the issue above. One issue, though, is that putting a file format as a type may not be an intended use of this attribute. The current draft PR (#191) temporarily takes this approach.
Make two CodeSystems, one for file format and the other for data type, and bind the data type CodeSystem to a new ValueSet, and, in turn, bind this ValueSet to DcoumentReference.type. Then, use content.format and bind the file format CodeSystem to another new ValueSet, and, in turn, bind this ValueSet to content.format.
Create a new extension, say, called file-type which has two sub-attributes, file-format, and data-type. Make two CodeSystems and ValueSets, one for file format and the other for data type, and bind the ValueSets to the sub-attributes respectively.

For any approaches of 2.2, we need canonical codes and displays for file format and data type respectively.

Re @allisonheath @baileyckelly

liberaliscomputing commented 4 years ago

Size: Based on @ShahimEssaid's suggestion, we will create an extension called large-size where the data type is decimal. This extension will be bound to Attachment.
File format / data type: We will move forward as illustrated in 2.2.2, creating new CodeSystems and ValueSets, based on the following, but not limited to, resources:

liberaliscomputing commented 4 years ago

Re Data type / file format for kfdrc-genomic-file, during the standup on 09-16-2020, we temporarily decided:

Data type: to create new CodeSystem and ValueSet off of NCIt (for unmappable codes, create a separate CodeSytem with them and bind it to the same ValueSet).
File format: to create neither CodeSystem nor ValueSet until we've found an established, self-maintained ontology. We will simply put KF's existing file format enumerations to DocumentReference.content.format.display.

The followings are the data types that I cannot easily map to NCIt codes:

"Aligned Reads Index"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
- "Aligned Sequence Read"
- "Index"
"Expression"
- Is it same as "Gene Expression"?
- If not, "Expression"
"gVCF"
- Is it same as "Variant Call File Format"?
"gVCF Index"
- There is no direct mapping.
- If the above is correct, we may put these together in DocumenceReference.type.coding?:
  - "Variant Call File Format"
- "Index"
"Histology Images"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
- "Histology"
- "Image"
"Simple Nucleotide Variations"
- Is it same as "Single Nucleotide Variant"?
"Radiology Images"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
- "Radiology"
- "Image"
"Radiology Reports"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
- "Radiology"
- "Report"
"Variant Calls"
- Is it same as "Single Nucleotide Polymorphism"?
"Variant Calls Index"
- There is no direct mapping.
- If the above is correct, we may put these together in DocumenceReference.type.coding?:
- "Single Nucleotide Polymorphism"
- "Index"
"Isoform Expression"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
- "Isoform"
- "Expression"
"Somatic Copy Number Variations"
- There is no direct mapping.
- We may put these together in DocumenceReference.type.coding?:
  - "Somatic"
  - "Copy Number Polymorphism"
"Somatic Structural Variations"
- There is no direct mapping.

liberaliscomputing commented 4 years ago

Re Data type for kfdrc-genomic-file, during the call on 09-21-2020, we temporarily decided to put the above-unmapped enumerations to DocumentReference.type.text without system and code.

kids-first / kf-model-fhir

Develop and test Kids First genomic file conformance resources #187