iossifovlab / gpf

GPF: Genotypes and Phenotypes in Families
MIT License
2 stars 0 forks source link

Implement resource statistics access interface for the genomic scores statistics #331

Open lchorbadjiev opened 1 year ago

lchorbadjiev commented 1 year ago

The prebuild statistics are part of the genomic resource files. For example, for a position score phyloP7way, the directory structure should be similar to the following:

phyloP7way/
├── genomic_resource.yaml
├── statistics
│   ├── statistics.hash
│   ├── histogram_phyloP7way.yaml
│   ├── histogram_phyloP7way.png
│   └── min_max_phyloP7way.yaml
├── index.html
├── phyloP7way.bedGraph.gz
├── phyloP7way.bedGraph.gz.dvc
├── phyloP7way.bedGraph.gz.tbi
└── phyloP7way.bedGraph.gz.tbi.dvc

For CADD_v1.4 that has two scores, the directory structure should be similar to:

CADD_v1.4/
├── genomic_resource.yaml
├── statistics
│   ├── statistics.hash
│   ├── histogram_cadd_phred.yaml
│   ├── histogram_cadd_phred.png
│   ├── histogram_cadd_raw.yaml
│   ├── histogram_cadd_raw.png
│   ├── min_max_cadd_phred.yaml
│   └── min_max_cadd_raw.yaml
├── index.html
├── whole_genome_SNVs.tsv.gz
├── whole_genome_SNVs.tsv.gz.dvc
├── whole_genome_SNVs.tsv.gz.tbi
└── whole_genome_SNVs.tsv.gz.tbi.dvc

Example score statistics interface to access this data should be similar to the following:


class GenomicScoreStatistics:

    def __init__(self, resource: GenomicResource):
        self.resource = resource

    def get_score_min(self, score_id: str) -> Optional[float]:
        """Read and return score min value.

        The method should construct
        score_id MinMaxStatistics filename. If the file does not exist, return
        None.

        If the file exists, read the content of the file
        from the resource, deserialize it using MinMaxStatistic.deserialize
        and return min_value.
        """

    def get_score_max(self, score_id: str) -> Optional[float]:
        """Read and return score max value.

        Implementation should similar to the above.
        """

    def get_score_histogram(self, score_id: str) -> Optional[Histogram]:
        """Read and return the histogram of score_id.

        This method should construct the score_id histogram filename. If the
        file does not exist, the method should return None.

        If the file exists, the method will read the file content and deserialize it using
        Histogram.deserialize() and return the result.
        """
IvoTod commented 1 year ago

We should consider whether we want to return None when a certain score is not found or if we should raise an error.