nf-cmgg / structural

A bioinformatics best-practice analysis pipeline for calling structural variants (SVs), copy number variants (CNVs) and repeat region expansions (RREs) from short DNA reads
https://nf-cmgg.github.io/structural/
MIT License
18 stars 3 forks source link

Add VCF to JSON conversion #60

Open nvnieuwk opened 9 months ago

nvnieuwk commented 9 months ago

Description of feature

Add VCF to JSON conversion with certain filter to use for the visualization of the SVs

janvandenschilden commented 9 months ago
from pydantic import BaseModel, Field, validator
from typing import Optional

class Model(BaseModel):
    chr: str = Field(..., description="The chromosome where the structural variant (SV) is located")
    start_position: int = Field(..., description="The start position of the SV on the chromosome", ge=0)
    end_position: int = Field(..., description="The end position of the SV on the chromosome", ge=0)
    end_chr: Optional[str] = Field(None, description="The end chromosome where the SV is located, if different from the start chromosome")
    end_chr_start_position: Optional[int] = Field(None, description="The start position of the SV on the end chromosome, if different from the start chromosome", ge=0)
    end_chr_end_position: Optional[int] = Field(None, description="The end position of the SV on the end chromosome, if different from the start chromosome", ge=0)
    sv_type: str = Field(..., description="The type of the SV, such as deletion, duplication, inversion, translocation, etc.")
    size: int = Field(..., description="The size of the SV in base pairs", ge=0)
    caller: str = Field(..., description="The name of the tool or algorithm that detected the SV")
    qc: str = Field(..., description="The quality control status of the SV")
    genotype: str = Field(..., description="The genotype of the SV, such as 0/0, 0/1, 1/1, etc.")
    relevant_genes: Optional[list[str]] = Field(None, description="The list of genes that are affected by the SV")
    population_frequency: Optional[float] = Field(None, description="The frequency of the SV in the general population, if available", ge=0, le=1)
    repeat_content: Optional[bool] = Field(None, description="Whether the SV is located in a repeat region or not")

    @validator('end_chr', 'end_chr_start_position', 'end_chr_end_position', always=True)
    def check_end_chr(cls, v, values):
        # If end_chr is not None, then end_chr_start_position and end_chr_end_position must also be not None
        if values.get('end_chr') is not None and (values.get('end_chr_start_position') is None or values.get('end_chr_end_position') is None):
            raise ValueError('end_chr_start_position and end_chr_end_position must be specified if end_chr is not None')
        # If end_chr is None, then end_chr_start_position and end_chr_end_position must also be None
        if values.get('end_chr') is None and (values.get('end_chr_start_position') is not None or values.get('end_chr_end_position') is not None):
            raise ValueError('end_chr_start_position and end_chr_end_position must be None if end_chr is None')
        return v
nvnieuwk commented 9 months ago

Can you also post an example of a JSON entry used to create this model?

janvandenschilden commented 9 months ago
{
  "chr": "chr1",
  "start_position": 123456,
  "end_position": 123789,
  "end_chr": null,
  "end_chr_start_position": null,
  "end_chr_end_position": null,
  "sv_type": "deletion",
  "size": 333,
  "caller": "Manta",
  "qc": "PASS",
  "genotype": "0/1",
  "relevant_genes": ["BRCA1"],
  "population_frequency": 0.001,
  "repeat_content": false
}
nvnieuwk commented 9 months ago

Some simple notes I already have seeing this:

These are just suggestion, please let me know what you think :smiley: