varfish-org / varfish-server

VarFish: comprehensive DNA variant analysis for diagnostics and research
MIT License
43 stars 11 forks source link

Design "case manifest" file format and import API #505

Open holtgrewe opened 2 years ago

holtgrewe commented 2 years ago

Is your feature request related to a problem? Please describe. We can currently (only) provide a PED file for describing a case. It would be very helpful to augment this as we currently already store information in VarFish that cannot be encoded in a canonical PLINK PED file. This means we cannot export into the same format that we import from and it would also be nice to import with more information as well.

Describe the solution you'd like Design a "case manifest" format that describes the relevant aspects of a case. We should reuse (community) standard data formats where possible; including:

Describe alternatives you've considered N/A

Additional context

holtgrewe commented 1 year ago

Notes

Below are some notes on what we will need.

Storage

Import Process

precondition: server and project correctly configured, storage setup

holtgrewe commented 1 year ago

Phenopackets documentation

This includes:

GA4GH Pedigree standard

Sketch

Phenopacket capturing most aspects. Phenopacket authors recommend to use file annotation for capturing sequencing details.

# A VarFish Case corresponds to a Phenopacket family.
family:
  # The identifier is automatically set to the SODAR UUID when created.
  id: $CASE_SODAR_UUID
  # The proband and relatives must match the definitions in pedigree
  # below.
  proband:
    id: $INDEX_NAME
    subject:
      id: $INDEX_NAME
      sex: MALE # FEMALE / OTHER_SEX / UNKNOWN_SEX
      karyotypic_sex: UNKNOWN_KARYOTYPE  # cf. https://phenopacket-schema.readthedocs.io/en/latest/karyotypicsex.html#rstkaryotypicsex
    phenotypicFeatures:
      - type:
          id: "HP:0012469"
          label: "Infantile spasms"
        excluded: false
        modifiers:
          - id: "HP:0031796"
            label: "Recurrent"
    measurements:
      # we only allow one measurement in VarFish 2.0
      - description: WGS / WES / Panel-seq of
        assay:
          id: NCIT:C158253
          label: Targeted Genome Sequencing
          # alternative #1
          #id: NCIT:C101295
          #label: Whole Exome Sequencing
          # alternative #2
          #id: NCIT:C101294
          #label: Whole Genome Sequencing
        measurement_value:
          value:
            id: NCIT:C171177
            label: Sequencing Data File
        time_observed:  # optional
          timestamp: the timestamp
    diseases:
      - term: OMIM:xxx
        excluded: false
    files: # FILES FOR PROBAND
      # Use s3:// URL with path to targets to identify the enrichment kit.
      - uri: s3://...
        individualToFileIdentifiers:
          IDENTIFIER_INDEX: identifier in file
        fileAttributes:
          genomeAssembly: GRCh38  # GRCh37
          fileFormat: vcf  # BAM etc.
          description: free-text description
        # TODO: more file examples, possibly extended attributes
    metadata: # COPY AND PASTE FROM BELOW
  relatives:
    - # ... list of phenopackets
  pedigree:
    persons:
      - familyId: FAM
        individualId: IDENTIFIER_INDEX
        patenralId: 0
        maternalId: 0
        sex: MALE
        affectedStatus: UNAFFECTED
  files:  # FILES FOR WHOLE FAMILY
    - uri: ...
  metadata:
    created: 2019-07-21T00:25:54.662Z
    createdBy: Peter R.
    resources:
      - id: hp
        name: human phenotype ontology
        url: http://purl.obolibrary.org/obo/hp.owl
        version: 2018-03-08
        namespacePrefix: HP
        iriPrefix: hp
      - id: geno
        name: Genotype Ontology
        url: http://purl.obolibrary.org/obo/geno.owl
        version: 19-03-2018
        namespacePrefix: GENO
        iriPrefix: geno
      - id: pubmed
        name: PubMed
        url: https://www.ncbi.nlm.nih.gov/pubmed/
        namespacePrefix: PMID
      - id: orphanet
        name: orphanet rare disease ontology
        url: http://www.orpha.net/
        namespacePrefix: ORPHA
        iriPrefix: orpha
      - id: omim
        name: Online Mendelian Inheritance in Man
        url: http://www.omim.org/
        namespacePrefix: OMIM
        iriPrefix: omim
      - id: ncit
        name: National Cancer Institute Thesaurus
        url: https://bioportal.bioontology.org/ontologies/NCIT/
        namespacePrefix: NCIT
        iriPrefix: ncit
    phenopacketSchemaVersion: 2.0
holtgrewe commented 1 year ago

Design Proposal

Case model will be changed to match the Phenopacket subset that we aim to support. We will replace the CaseImportInfo record with a CaseAction model. Mass data will be stored outside of the database. This could be an S3 storage that the user has read/write and VarFish has read access to. VarFish will store any data in an internal storage, e.g., an internal S3 storage.

User Stories

States

CaseAction states:

Case states:

Invariants

Case Description

We use the family top level element for phenopackets 2.0.

The following metadata entry is supported (versions can be adjusted, id/prefixes, urls must stah the same). The same is used in all relevant places.

metadata:
  created: $CREATION
  createdBy: $CREATOR
  resources:
    - id: hp
      name: human phenotype ontology
      url: http://purl.obolibrary.org/obo/hp.owl
      version: 2018-03-08
      namespacePrefix: HP
      iriPrefix: hp
    - id: geno
      name: Genotype Ontology
      url: http://purl.obolibrary.org/obo/geno.owl
      version: 19-03-2018
      namespacePrefix: GENO
      iriPrefix: geno
    - id: pubmed
      name: PubMed
      url: https://www.ncbi.nlm.nih.gov/pubmed/
      namespacePrefix: PMID
    - id: orphanet
      name: orphanet rare disease ontology
      url: http://www.orpha.net/
      namespacePrefix: ORPHA
      iriPrefix: orpha
    - id: omim
      name: Online Mendelian Inheritance in Man
      url: http://www.omim.org/
      namespacePrefix: OMIM
      iriPrefix: omim
    - id: ncit
      name: National Cancer Institute Thesaurus
      url: https://bioportal.bioontology.org/ontologies/NCIT/
      namespacePrefix: NCIT
      iriPrefix: ncit

We support the full types of family.pedigree.persons. The family_id of all must be the same and the individual_id must link back to the proband/relatives id and subject.id.

family:
  pedigree:
    persons:
      - family_id: FAM
        individual_id: IDENTIFIER_INDEX
        paternal_id: 0
        maternal_id: 0
        sex: MALE
        affected_status: UNAFFECTED
      # ...

The family.proband and family.relatives suport the following Phenopackets subset (here for family.proband)

family:
  proband:
    id: IDENTIFIER_INDEX
    subject:
      id: IDENTIFIER_INDEX  # must match ../../id
      sex:  # all supported
      karyotypic_sex:  # all supported
    diseases:
      - term: # OMIM or Orphanet disease
        excluded: false  # or true
    phenotypic_features:
      - type:
          id:
          label:
        excluded: false  # or true
        modifiers:
          - id:
            label:
    measurements:
      # exactly one measurment is allowed
      - description: WGS  # or WES or Panel-seq
        assay:
          id: NCIT:C158253
          label: Targeted Genome Sequencing
          # # alternative #1
          # id: NCIT:C101295
          # label: Whole Exome Sequencing
          # # alternative #2
          # id: NCIT:C101294
          # label: Whole Genome Sequencing
    files:
      # MUST define path to enrichment kit unless WGS
      - uri: s3://...
        file_attributes:
          designation: enrichment_kit_targets
          genome_assembly: GRCh37  # or GRCh38 MUST be given
          file_format: BED
          description: free-text description
      # MAY contain further files for definining per-sample files
      - uri: ...
        file_attributes: {} # ...
        # use file_format: BAM/CRAM for files alignments

The family.files list can contain small and structural variant files:

family:
  files:
    - uri: s3://path/family.gatk_hc.vcf.gz
      file_attributes:
        designation: seqvars
        genome_assembly: GRCh37  # or GRCh38 MUST be given and consistent in case
        file_format: VCF
        caller: GATK-HC
    - uri: s3://path/family.delly2.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: Delly2
    - uri: s3://path/family.manta.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: Manta
    - uri: s3://path/family.gcnv.vcf.gz
      file_attributes:
        designation: strucvars
        genome_assembly: GRCh37  # or GRC38 MUST be given and consistent in case
        file_format: VCF
        caller: GATK-gCNV

New App

We introduce the new Django app case_import for the new functionality.

API Endpoints

We introduce the following new endpoints for CaseImportAction.

Note that the current disease and phenotype information may differ for a case and re-importing it based on a previous import action only will override this information. Users thus have to fetch information from the normal case API.