Design "case manifest" file format and import API

Is your feature request related to a problem? Please describe. We can currently (only) provide a PED file for describing a case. It would be very helpful to augment this as we currently already store information in VarFish that cannot be encoded in a canonical PLINK PED file. This means we cannot export into the same format that we import from and it would also be nice to import with more information as well.

Describe the solution you'd like Design a "case manifest" format that describes the relevant aspects of a case. We should reuse (community) standard data formats where possible; including:

Phenopackets 2.0 Format

Describe alternatives you've considered N/A

Additional context

Notes

Below are some notes on what we will need.

Storage

storage will be provided on top of libcloud which will allow transparent usage of local file system and S3 protocol
further, files can also be deposited behind a password protected HTTS location
per project, users have to configure "upload storage", where data is uploaded to (configured with server hostname (unless local storage) and credentials; paths will be relative to this
admins have to setup an "internal" storage that only the VarFish services can access; the default/example deployment will use MINIO S3 single-node deployment for this
the manifest points into the per-project storage where data will be read from
data is then written into the internal storage; using the project UUID as the prefix
next level of prefix is the case UUID; to accomodate local file storage, a 4-letter prefix sub directory scheme will be used to prevent too large directories
files stored in the internal storage will have registered meta information in the VarFish database

Import Process

precondition: server and project correctly configured, storage setup

create case in "initial" state or set case to "updating" state
create/update manifest and file list
mark case as "import", will start the annotation process
on success, previous internal data will be archived/cleaned up

Phenopackets documentation

https://phenopacket-schema.readthedocs.io/en/latest/index.html

This includes:

pedigree: https://phenopacket-schema.readthedocs.io/en/latest/pedigree.html
- note: need to add x-xy-karyotype for gonosomal karyotype

GA4GH Pedigree standard

https://github.com/GA4GH-Pedigree-Standard/pedigree

Sketch

Phenopacket capturing most aspects. Phenopacket authors recommend to use file annotation for capturing sequencing details.

# A VarFish Case corresponds to a Phenopacket family.
family:
  # The identifier is automatically set to the SODAR UUID when created.
  id: $CASE_SODAR_UUID
  # The proband and relatives must match the definitions in pedigree
  # below.
  proband:
    id: $INDEX_NAME
    subject:
      id: $INDEX_NAME
      sex: MALE # FEMALE / OTHER_SEX / UNKNOWN_SEX
      karyotypic_sex: UNKNOWN_KARYOTYPE  # cf. https://phenopacket-schema.readthedocs.io/en/latest/karyotypicsex.html#rstkaryotypicsex
    phenotypicFeatures:
      - type:
          id: "HP:0012469"
          label: "Infantile spasms"
        excluded: false
        modifiers:
          - id: "HP:0031796"
            label: "Recurrent"
    measurements:
      # we only allow one measurement in VarFish 2.0
      - description: WGS / WES / Panel-seq of
        assay:
          id: NCIT:C158253
          label: Targeted Genome Sequencing
          # alternative #1
          #id: NCIT:C101295
          #label: Whole Exome Sequencing
          # alternative #2
          #id: NCIT:C101294
          #label: Whole Genome Sequencing
        measurement_value:
          value:
            id: NCIT:C171177
            label: Sequencing Data File
        time_observed:  # optional
          timestamp: the timestamp
    diseases:
      - term: OMIM:xxx
        excluded: false
    files: # FILES FOR PROBAND
      # Use s3:// URL with path to targets to identify the enrichment kit.
      - uri: s3://...
        individualToFileIdentifiers:
          IDENTIFIER_INDEX: identifier in file
        fileAttributes:
          genomeAssembly: GRCh38  # GRCh37
          fileFormat: vcf  # BAM etc.
          description: free-text description
        # TODO: more file examples, possibly extended attributes
    metadata: # COPY AND PASTE FROM BELOW
  relatives:
    - # ... list of phenopackets
  pedigree:
    persons:
      - familyId: FAM
        individualId: IDENTIFIER_INDEX
        patenralId: 0
        maternalId: 0
        sex: MALE
        affectedStatus: UNAFFECTED
  files:  # FILES FOR WHOLE FAMILY
    - uri: ...
  metadata:
    created: 2019-07-21T00:25:54.662Z
    createdBy: Peter R.
    resources:
      - id: hp
        name: human phenotype ontology
        url: http://purl.obolibrary.org/obo/hp.owl
        version: 2018-03-08
        namespacePrefix: HP
        iriPrefix: hp
      - id: geno
        name: Genotype Ontology
        url: http://purl.obolibrary.org/obo/geno.owl
        version: 19-03-2018
        namespacePrefix: GENO
        iriPrefix: geno
      - id: pubmed
        name: PubMed
        url: https://www.ncbi.nlm.nih.gov/pubmed/
        namespacePrefix: PMID
      - id: orphanet
        name: orphanet rare disease ontology
        url: http://www.orpha.net/
        namespacePrefix: ORPHA
        iriPrefix: orpha
      - id: omim
        name: Online Mendelian Inheritance in Man
        url: http://www.omim.org/
        namespacePrefix: OMIM
        iriPrefix: omim
      - id: ncit
        name: National Cancer Institute Thesaurus
        url: https://bioportal.bioontology.org/ontologies/NCIT/
        namespacePrefix: NCIT
        iriPrefix: ncit
    phenopacketSchemaVersion: 2.0

Design Proposal

Case model will be changed to match the Phenopacket subset that we aim to support. We will replace the CaseImportInfo record with a CaseAction model. Mass data will be stored outside of the database. This could be an S3 storage that the user has read/write and VarFish has read access to. VarFish will store any data in an internal storage, e.g., an internal S3 storage.

User Stories

Case creation
- create new CaseAction with ACTION=CREATE => STATE=DRAFT, maybe as clone of existing
- delete CaseAction in STATE=DRAFT
Case update
- create new CaseAction with ACTION=UPDATE => STATE=DRAFT, maybe as clone of existing
- delete CaseAction in STATE=DRAFT
- submit CaseAction with STATE=DRAFT => STATE=SUBMITTED
Case deletion
- create new CaseAction with ACTION=DELETE => STATE=DRAFT, maybe as clone of existing
- delete CaseAction in STATE=DRAFT=> STATE=SUBMITTED

States

CaseAction states:

DRAFT
SUBMITTED
RUNNING
FAILED
SUCCESS

Case states:

INITIAL
ACTIVE
IMPORTING
DELETED

Invariants

at most one CaseAction in DRAFT or ACTIVE state per case
submitting CaseAction only valid if Case state is ACTIVE

Case Description

We use the family top level element for phenopackets 2.0.

The following metadata entry is supported (versions can be adjusted, id/prefixes, urls must stah the same). The same is used in all relevant places.

metadata:
  created: $CREATION
  createdBy: $CREATOR
  resources:
    - id: hp
      name: human phenotype ontology
      url: http://purl.obolibrary.org/obo/hp.owl
      version: 2018-03-08
      namespacePrefix: HP
      iriPrefix: hp
    - id: geno
      name: Genotype Ontology
      url: http://purl.obolibrary.org/obo/geno.owl
      version: 19-03-2018
      namespacePrefix: GENO
      iriPrefix: geno
    - id: pubmed
      name: PubMed
      url: https://www.ncbi.nlm.nih.gov/pubmed/
      namespacePrefix: PMID
    - id: orphanet
      name: orphanet rare disease ontology
      url: http://www.orpha.net/
      namespacePrefix: ORPHA
      iriPrefix: orpha
    - id: omim
      name: Online Mendelian Inheritance in Man
      url: http://www.omim.org/
      namespacePrefix: OMIM
      iriPrefix: omim
    - id: ncit
      name: National Cancer Institute Thesaurus
      url: https://bioportal.bioontology.org/ontologies/NCIT/
      namespacePrefix: NCIT
      iriPrefix: ncit

We support the full types of family.pedigree.persons. The family_id of all must be the same and the individual_id must link back to the proband/relatives id and subject.id.

family:
  pedigree:
    persons:
      - family_id: FAM
        individual_id: IDENTIFIER_INDEX
        paternal_id: 0
        maternal_id: 0
        sex: MALE
        affected_status: UNAFFECTED
      # ...

The family.proband and family.relatives suport the following Phenopackets subset (here for family.proband)

family:
  proband:
    id: IDENTIFIER_INDEX
    subject:
      id: IDENTIFIER_INDEX  # must match ../../id
      sex:  # all supported
      karyotypic_sex:  # all supported
    diseases:
      - term: # OMIM or Orphanet disease
        excluded: false  # or true
    phenotypic_features:
      - type:
          id:
          label:
        excluded: false  # or true
        modifiers:
          - id:
            label:
    measurements:
      # exactly one measurment is allowed
      - description: WGS  # or WES or Panel-seq
        assay:
          id: NCIT:C158253
          label: Targeted Genome Sequencing
          # # alternative #1
          # id: NCIT:C101295
          # label: Whole Exome Sequencing
          # # alternative #2
          # id: NCIT:C101294
          # label: Whole Genome Sequencing
    files:
      # MUST define path to enrichment kit unless WGS
      - uri: s3://...
        file_attributes:
          designation: enrichment_kit_targets
          genome_assembly: GRCh37  # or GRCh38 MUST be given
          file_format: BED
          description: free-text description
      # MAY contain further files for definining per-sample files
      - uri: ...
        file_attributes: {} # ...
        # use file_format: BAM/CRAM for files alignments

The family.files list can contain small and structural variant files:

family:
  files:
    - uri: s3://path/family.gatk_hc.vcf.gz
      file_attributes:
        designation: seqvars
        genome_assembly: GRCh37  # or GRCh38 MUST be given and consistent in case
        file_format: VCF
        caller: GATK-HC
    - uri: s3://path/family.delly2.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: Delly2
    - uri: s3://path/family.manta.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: Manta
    - uri: s3://path/family.gcnv.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: GATK-gCNV

New App

We introduce the new Django app case_import for the new functionality.

API Endpoints

We introduce the following new endpoints for CaseImportAction.

/case-import/api/case-import-action/list-create/<project>/[?case=<case>] - List all CaseImportAction objects in a project (optionally, only for a case). Also, allow creation of new one.
/case-import/api/case-import-action/retrieve-update/<case-import-action/ - Retrieve CaseImportAction. Also allows updates. This allows to to update the state of a DRAFT action to a SUBMITTED one, for example. However, illegal state transitions are prevented.

Note that the current disease and phenotype information may differ for a case and re-importing it based on a previous import action only will override this information. Users thus have to fetch information from the normal case API.

varfish-org / varfish-server