monarch-initiative / genophenocorr

Genotype Phenotype Correlation
https://monarch-initiative.github.io/genophenocorr/stable
MIT License
4 stars 1 forks source link

compare_func -- suggestions #8

Closed pnrobinson closed 10 months ago

pnrobinson commented 1 year ago

@lnrekerle

is_not_var_type -- it is not good to have two methods that return an inverted boolean. The client code should be re-written to use just the "is_variant_type" method. -- at a minimum the method should be like this to avoid odd bugs

is_not_var_type: return not self.is_var_type()

(Similar comments for other pairs of functions in this file)

is_var_match -- would this work better within the Patient class

def has_variant_of_type(self, variant_type:str):

The following function is hard-coding genome build 37 and will almost certainly lead to interesting bugs down the line... Also, the name of the function is "verify", but the function appears to be creating a Variant object.

def verify_var(variant):
    contig, start, ref, alt = variant.split(':')
    var = vc.Variant(contig, start, ref, alt, ensembl = pyensembl.ensembl_grch37)
    return var

def in_feature(pat, feature):

--- it would be simpler to not use the "isIn" variable as follows

    featureDF = pat.protein.features
    loc = pat.variant.protein_effect_location
    if loc is not None and not featureDF.empty:
        for row in featureDF.iterrows():
            if row[0] == feature or row[1]['type'] == feature:
                if row[1]['start'] <= loc <= row[1]['end']:
                    return True
    return False

In general, I wonder if we can refactor to put these functions in other classes?

lnrekerle commented 1 year ago

@pnrobinson

I will definitely fix the in_feature and the hard coded hg37 in there!

For the inverted functions, I did that because these are all called in the "run_stats" function of compare.py, so it made the most sense to me that they emphasize what the user is trying to do. for example if I'm running:

run_stats(cohort, is_var_type, is_not_var_type, "missense", "missense", etc...)

it's more clear that you are comparing those that are type missense to those that are not type missense. Which this is also why they are in there own file rather than in other classes, so they can be easily accessed by the run_stats function.

pnrobinson commented 1 year ago

Probably the API to this function needs to be rethought. If "is_var_type" and "is_not_vartype" completely divide up the space, then we do not need to pass two functions and at some point in the code we can go

(not is_var_type(arguments))

or something like that. Having two functions violates the one source of truth principle which is a common source of hard to track down bugs!