sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
232 stars 32 forks source link

"Compute" method naming convention #651

Open ravwojdyla opened 3 years ago

ravwojdyla commented 3 years ago

Some discussion in https://github.com/pystatgen/sgkit/pull/647#discussion_r686015457

As pointed out by @jeromekelleher it's interesting to look at the main init, to see current names. Some current options:

Some of them make sense in different context. In this issue we want to discuss the convention and document it.

timothymillar commented 3 years ago

Just noting that I started using infer_ for the ploidy methods because count_ didn't sound correct. I don't think infer_ is great either, but it's the best verb I could come up with for extracting something that's implicit from the call_genotype array.

timothymillar commented 3 years ago

It's also worth pointing out that in many cases in which the method doesn't have a prefix, the calculated variable has stat_ as a prefix. E.g. Tajimas_D returns variable stat_Tajimas_D.

jeromekelleher commented 3 years ago

I can see a few possibilities for a general rule that we could apply to everything:

  1. call the method to compute a variable the same thing as the variable
  2. call the method to compute a variable f"compute_{variable_name}" (or some other prefix)

There's some questions then about whether we should strip off the "group" prefix on the variable names for the compute function (like Tajima's D, as @timothymillar points out).

In general, I think we should try to get a point where we can determine the name of a method to compute a variable from the variable name. It would be nice to be able to formalise the graph of variable dependencies, and this would be a good step in that direction. Do you think this makes sense @tomwhite, and is something worthwhile to aim for?

tomwhite commented 3 years ago

In general, I think we should try to get a point where we can determine the name of a method to compute a variable from the variable name.

One thing worth mentioning is that there is not a 1:1 mapping between method and output variable, since some methods create multiple output variables. E.g. Garud_H creates stat_Garud_h1, stat_Garud_h12, stat_Garud_h123, stat_Garud_h2_h1 variables. (Also some variables are created by multiple methods - I'm thinking of the IO functions here.) So this makes determining the method name a bit harder.

It would be nice to be able to formalise the graph of variable dependencies, and this would be a good step in that direction.

I agree that would be nice.

I wrote a quick hack to inspect the call frame for methods to find their input and output variables, then ran the unit tests to generate reasonable call coverage. It's not complete, but gives an idea of the relationships between input variables, methods, and output variables. Notice that some methods can produce (or consume) different sets of variables depending on how they are called.

I'm not sure what the next step is though!

Method name -> input variables:

``` Fst ['stat_divergence'] [] Garud_H ['call_genotype'] [] Tajimas_D ['variant_allele_count', 'stat_diversity'] [] call_allele_frequencies ['call_allele_count'] [] convert_probability_to_call ['call_genotype_probability'] [] count_call_alleles ['call_genotype'] [] count_cohort_alleles ['call_allele_count'] [] count_variant_alleles ['call_allele_count'] [] create_genotype_dosage_dataset [] divergence ['cohort_allele_count'] [] diversity ['cohort_allele_count'] [] ['cohort_allele_count_non_default'] filter_partial_calls ['call_genotype'] [] gwas_linear_regression ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_0'] [] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_1'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_2'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_3'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_4'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_5'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_6'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_7'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_8'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'trait_9'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_0'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_1'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_2'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_3'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_4'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_5'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_6'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_7'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_8'] ['dosage', 'covar_0', 'covar_1', 'covar_2', 'covar_3', 'trait_9'] ['dosage', 'trait_0'] ['dosage', 'trait_1'] ['dosage', 'trait_2'] ['dosage', 'trait_3'] ['dosage', 'trait_4'] ['dosage', 'trait_5'] ['dosage', 'trait_6'] ['dosage', 'trait_7'] ['dosage', 'trait_8'] ['dosage', 'trait_9'] ['dosage', 'covar_0', 'trait_0'] ['dosage', 'covar_0', 'trait_1'] ['dosage', 'covar_0', 'trait_2'] ['dosage', 'covar_0', 'trait_3'] ['dosage', 'covar_0', 'trait_4'] ['dosage', 'covar_0', 'trait_5'] ['dosage', 'covar_0', 'trait_6'] ['dosage', 'covar_0', 'trait_7'] ['dosage', 'covar_0', 'trait_8'] ['dosage', 'covar_0', 'trait_9'] ['dosage', 'covar_0', 'trait_0', 'trait_1', 'trait_2', 'trait_3', 'trait_4', 'trait_5', 'trait_6', 'trait_7', 'trait_8', 'trait_9'] hardy_weinberg_test ['call_genotype_mask', 'call_genotype'] [] ['variant_genotype_count'] individual_heterozygosity ['call_allele_count'] [] infer_call_ploidy ['call_genotype'] [] infer_sample_ploidy [] infer_variant_ploidy [] ld_matrix ['dosage'] ['variant_score'] observed_heterozygosity ['call_heterozygosity'] [] pbs ['stat_Fst'] [] pc_relate ['call_genotype', 'call_genotype_mask', 'sample_pca_projection'] [] pca [] read_vcfzarr [] regenie ['call_dosage', 'variant_contig', 'sample_covariate', 'sample_trait'] [] sample_stats ['call_genotype', 'call_genotype_mask'] [] ['call_genotype_mask', 'call_genotype'] simulate_genotype_call_dataset [] variant_stats ['call_genotype', 'call_genotype_mask'] [] ['call_genotype_mask', 'call_genotype'] ['variant_allele_count'] window_by_position [] window_by_variant [] ```

Method name -> output variables:

``` Fst ['stat_Fst'] Garud_H ['stat_Garud_h1', 'stat_Garud_h12', 'stat_Garud_h123', 'stat_Garud_h2_h1'] Tajimas_D ['stat_Tajimas_D'] call_allele_frequencies ['call_allele_frequency'] convert_probability_to_call ['call_genotype', 'call_genotype_mask'] count_call_alleles ['call_allele_count'] count_cohort_alleles ['cohort_allele_count'] count_variant_alleles ['variant_allele_count'] create_genotype_dosage_dataset ['variant_contig', 'variant_position', 'variant_allele', 'sample_id', 'call_dosage', 'call_dosage_mask', 'call_genotype_probability', 'call_genotype_probability_mask', 'variant_id'] divergence ['stat_divergence'] diversity ['stat_diversity'] filter_partial_calls ['call_genotype_complete', 'call_genotype_complete_mask'] gwas_linear_regression ['variant_linreg_beta', 'variant_linreg_t_value', 'variant_linreg_p_value'] hardy_weinberg_test ['variant_n_het', 'variant_n_hom_ref', 'variant_n_hom_alt', 'variant_n_non_ref'] ['variant_hwe_p_value'] individual_heterozygosity ['call_heterozygosity'] infer_call_ploidy ['call_genotype_non_allele'] ['call_ploidy'] infer_sample_ploidy ['sample_ploidy'] infer_variant_ploidy ['variant_ploidy'] observed_heterozygosity ['stat_observed_heterozygosity'] pbs ['stat_pbs'] pc_relate ['pc_relate_phi'] read_vcfzarr ['variant_contig', 'variant_position', 'variant_allele', 'sample_id', 'call_genotype', 'call_genotype_mask', 'variant_id'] regenie ['regenie_base_prediction', 'regenie_meta_prediction'] ['regenie_base_prediction', 'regenie_meta_prediction', 'regenie_loco_prediction'] sample_stats ['sample_n_called', 'sample_call_rate'] ['sample_n_het', 'sample_n_hom_ref', 'sample_n_hom_alt', 'sample_n_non_ref'] simulate_genotype_call_dataset ['variant_contig', 'variant_position', 'variant_allele', 'sample_id', 'call_genotype', 'call_genotype_mask'] variant_stats ['variant_n_called', 'variant_call_rate'] ['variant_n_het', 'variant_n_hom_ref', 'variant_n_hom_alt', 'variant_n_non_ref'] ['variant_allele_count', 'variant_allele_total', 'variant_allele_frequency'] ['variant_allele_total', 'variant_allele_frequency'] window_by_position ['window_contig', 'window_start', 'window_stop'] window_by_variant ['window_contig', 'window_start', 'window_stop'] ```