Implement support for new-style imports

holtgrewe commented 1 year ago

Is your feature request related to a problem? Please describe. The new-style imports (based on depositing files in an external storage and registering the case as phenopackets) is currently unsupported in VarFish.

Describe the solution you'd like Implement the import.

[x] implement parsing of per-project TOML configuration
[x] implement varfish-cli projects project-load-config PROJECT_UUID (see below)
[ ] implement validating phenopackets YAML files
[ ] implement submission of phenopackets YAML files

Describe alternatives you've considered N/A

Additional context N/A

holtgrewe commented 1 year ago

Specification: Extending Client Configuration

New-style imports deposit files in external storage. We thus need to make projects known to varfish. This should be done in the ~/.varfishrc.toml file.

Here, is how to create a list of projects in general in toml

# ...

[[projects]]
uuid = "..."

[[projects]]
uuid = "..."

This will be loaded as {'projects': [{'uuid': '...'}, {'uuid': '...'}]} in JSON/Python.

Users can configure projects with the following schema:

[[projects]]
title = "..."  # optional; user-readable project title
uuid = "..."  # SODAR project UUID
# protocol to use for import
import_data_protocol = "s3" # one of "s3" | "http" | "https" | "file"
import_data_path = "..."      # path prefix to use
import_data_port = 80         # optional; port to user for connecting on import
import_data_user = "user"     # user/S3 access key
import_data_password = "key"  # password/S3 secret key to use

We should support users with the possibility to download these settings via the following command. This should fetch the settings from above from the server and append to the projects configuration in the TOML.

varfish-cli projects project-load-config PROJECT_UUID

holtgrewe commented 1 year ago

Specification: Manifest Files

This follows the phenopackets YAML format supported by VarFish Server.

General note on files:

list of designations should be taken from varfish-server code and documented also in varfish-cli
same for mimetypes

Notes on individuals' files:

only one BAM file supported for each individual
first file for each individual is the sequencing kits
- special meaning, should start with s3://varfish-server/seqmeta/enrichment-kits and refers to the internal files
BAM files will only be registered as external files

Notes on family files:

only one seqvars VCF allowed, all strucvars VCFs will be merged

- phenopackets YAML example :file_folder:

```yaml # family with only metadata field family: proband: id: index subject: id: index sex: MALE karyotypicSex: XY phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: false modifiers: - id: "HP:0031796" label: "Recurrent" measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: index: index-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/index.bam individualToFileIdentifiers: mother: index-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: false metaData: &metadata-prototype created: "2019-07-21T00:25:54.662Z" createdBy: Peter R. resources: - id: hp name: human phenotype ontology url: http://purl.obolibrary.org/obo/hp.owl version: "2018-03-08" namespacePrefix: HP iriPrefix: hp phenopacketSchemaVersion: "2.0" relatives: - id: mother subject: id: mother sex: FEMALE karyotypicSex: XX phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: true measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: mother: mother-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/mother.bam individualToFileIdentifiers: mother: mother-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: true metaData: *metadata-prototype - id: father subject: id: father sex: MALE karyotypicSex: XY phenotypicFeatures: - type: id: "HP:0012469" label: "Infantile spasms" excluded: true measurements: - assay: id: NCIT:C158253 label: Targeted Genome Sequencing value: ontologyClass: id: NCIT:C171177 label: Sequencing Data File files: - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz individualToFileIdentifiers: father: father-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: sequencing_targets genomebuild: grch38 mimetype: text/x-bed+x-bgzip - uri: s3://data-for-import/example/father.bam individualToFileIdentifiers: father: father-PANEL fileAttributes: checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 designation: read_alignments genomebuild: grch38 mimetype: text/x-bam+x-bgzip diseases: - term: id: OMIM:164400 label: "SPINOCEREBELLAR ATAXIA 1; SCA1" excluded: true metaData: *metadata-prototype pedigree: persons: - familyId: Case individualId: index paternalId: father maternalId: mother sex: MALE affectedStatus: AFFECTED - familyId: Case individualId: father paternalId: "0" maternalId: "0" sex: MALE affectedStatus: UNAFFECTED - familyId: Case individualId: mother paternalId: "0" maternalId: "0" sex: FEMALE affectedStatus: UNAFFECTED files: - uri: file://cases_import/tests/data/sample-brca1.vcf.gz individualToFileIdentifiers: index: NA12878-PCRF450-1 fileAttributes: checksum: sha256:4042c2afa59f24a327b3852bfcd0d8d991499d9c4eb81e7a7efe8d081e66af82 designation: variant_calls variant_type: seqvars genomebuild: grch37 mimetype: text/plain+x-bgzip+x-variant-call-format - uri: file://cases_import/tests/data/sample-brca1.vcf.gz.tbi individualToFileIdentifiers: index: NA12878-PCRF450-1 fileAttributes: checksum: sha256:6b137335b7803623c3389424e7b64d704fb1c9f3f55792db2916d312e2da27ef designation: variant_calls variant_type: seqvars genomebuild: grch37 mimetype: application/octet-stream+x-tabix-tbi-index metaData: *metadata-prototype ```

holtgrewe commented 1 year ago

Specification: Client Side of Import Process

Precondition:

project is configured in ~/.varfishrc.toml

Then:

read YAML as phenopackets
check that kit specification BED is there
check designations and mimetypes of files are known
check that at most one BAM file is there for each sample
check that at most one seqvars VCF file is there
check hat the files exist in the storage

varfish-org / varfish-cli