varfish-org / varfish-cli

VarFish REST API client (CLI + Python package)
MIT License
2 stars 3 forks source link

Implement support for new-style imports #70

Open holtgrewe opened 1 year ago

holtgrewe commented 1 year ago

Is your feature request related to a problem? Please describe. The new-style imports (based on depositing files in an external storage and registering the case as phenopackets) is currently unsupported in VarFish.

Describe the solution you'd like Implement the import.

Describe alternatives you've considered N/A

Additional context N/A

holtgrewe commented 1 year ago

Specification: Extending Client Configuration

New-style imports deposit files in external storage. We thus need to make projects known to varfish. This should be done in the ~/.varfishrc.toml file.

Here, is how to create a list of projects in general in toml

# ...

[[projects]]
uuid = "..."

[[projects]]
uuid = "..."

This will be loaded as {'projects': [{'uuid': '...'}, {'uuid': '...'}]} in JSON/Python.

Users can configure projects with the following schema:

[[projects]]
title = "..."  # optional; user-readable project title
uuid = "..."  # SODAR project UUID
# protocol to use for import
import_data_protocol = "s3" # one of "s3" | "http" | "https" | "file"
import_data_path = "..."      # path prefix to use
import_data_port = 80         # optional; port to user for connecting on import
import_data_user = "user"     # user/S3 access key
import_data_password = "key"  # password/S3 secret key to use

We should support users with the possibility to download these settings via the following command. This should fetch the settings from above from the server and append to the projects configuration in the TOML.

varfish-cli projects project-load-config PROJECT_UUID
holtgrewe commented 1 year ago

Specification: Manifest Files

This follows the phenopackets YAML format supported by VarFish Server.

General note on files:

Notes on individuals' files:

Notes on family files:

- phenopackets YAML example :file_folder: ```yaml # family with only metadata field family: proband: id: index subject: id: index sex: MALE karyotypicSex: XY phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: false modifiers: - id: "HP:0031796" label: "Recurrent" measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: index: index-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/index.bam individualToFileIdentifiers: mother: index-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: false metaData: &metadata-prototype created: "2019-07-21T00:25:54.662Z" createdBy: Peter R. resources: - id: hp name: human phenotype ontology url: http://purl.obolibrary.org/obo/hp.owl version: "2018-03-08" namespacePrefix: HP iriPrefix: hp phenopacketSchemaVersion: "2.0" relatives: - id: mother subject: id: mother sex: FEMALE karyotypicSex: XX phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: true measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: mother: mother-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/mother.bam individualToFileIdentifiers: mother: mother-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: true metaData: *metadata-prototype - id: father subject: id: father sex: MALE karyotypicSex: XY phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: true measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: father: father-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/father.bam individualToFileIdentifiers: father: father-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: true metaData: *metadata-prototype pedigree: persons: - familyId: Case individualId: index paternalId: father maternalId: mother sex: MALE affectedStatus: AFFECTED - familyId: Case individualId: father paternalId: "0" maternalId: "0" sex: MALE affectedStatus: UNAFFECTED - familyId: Case individualId: mother paternalId: "0" maternalId: "0" sex: FEMALE affectedStatus: UNAFFECTED files: - uri: file://cases_import/tests/data/sample-brca1.vcf.gz individualToFileIdentifiers: index: NA12878-PCRF450-1 fileAttributes: checksum: sha256:4042c2afa59f24a327b3852bfcd0d8d991499d9c4eb81e7a7efe8d081e66af82 designation: variant_calls variant_type: seqvars genomebuild: grch37 mimetype: text/plain+x-bgzip+x-variant-call-format - uri: file://cases_import/tests/data/sample-brca1.vcf.gz.tbi individualToFileIdentifiers: index: NA12878-PCRF450-1 fileAttributes: checksum: sha256:6b137335b7803623c3389424e7b64d704fb1c9f3f55792db2916d312e2da27ef designation: variant_calls variant_type: seqvars genomebuild: grch37 mimetype: application/octet-stream+x-tabix-tbi-index metaData: *metadata-prototype ```
holtgrewe commented 1 year ago

Specification: Client Side of Import Process

Precondition:

Then:

  1. read YAML as phenopackets
  2. check that kit specification BED is there
  3. check designations and mimetypes of files are known
  4. check that at most one BAM file is there for each sample
  5. check that at most one seqvars VCF file is there
  6. check hat the files exist in the storage