tellae / bhepop2

Synthetic population enrichment from aggregated data
https://bhepop2.readthedocs.io/en/latest/
Other
2 stars 1 forks source link

Distributions could be refactored into classes #41

Open leo-desbureaux-tellae opened 7 months ago

leo-desbureaux-tellae commented 7 months ago

Actual state

In the last state of Bhepop2, EnrichmentSource subclasses (for instance MarginalDistributions') describe the nature of the distributions they are manipulating (qualitative, quantitative with deciles).

This leads to code duplication, for instance when classes like QuantitativeGlobalDistribution also use quantitative distributions, but this time a single one, that describes all the population. Some of the code is then duplicated, like validation of the source data, feature states evaluation, or drawing a value in this feature state.

Proposition

A good implementation would seem to be that EnrichmentSource subclasses would only describe the structure of the distributions (ex: one for each modality) and how they are linked to the population (ex: applies to the individuals with an equal modality), independently of the type of distributions they contain.

class MarginalDistributions(EnrichmentSource):
    attribute_selection: list
    modalities: dict
    distributions: {(attribute, modality): Distribution}
    get_modality_distribution(attribute, modality)

Other classes would then describe the various kind of distributions: qualitative, quantitative deciles, we could even imagine using discrete probability distributions like Poisson. Specific functions, like evaluating an interval probability from a decile distribution by interpolation (interpolate_feature_prob) could me moved to their dedicated class, for instance DecilesDistribution

In the end, instead of having something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    MarginalDistributions <|-- QualitativeMarginalDistributions
    MarginalDistributions <|-- QuantitativeMarginalDistributions

    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }

namespace Enrichment sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data
        +list feature_values
        +int nb_feature_values
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        <<Abstract>>
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
        -_validate_data_type()
    }

    class QualitativeMarginalDistributions{

    }

    class QuantitativeMarginalDistributions{

}
}

We would have something like that

---
title: Enrichment classes
---
classDiagram
    SyntheticPopulationEnrichment <|-- Bhepop2Enrichment
    SyntheticPopulationEnrichment *-- EnrichmentSource
    EnrichmentSource <|-- MarginalDistributions
    EnrichmentSource *-- Distribution
    Distribution <|-- DecilesDistribution
    Distribution <|-- QualitativeDistribution

   namespace Enrichment { 
    class SyntheticPopulationEnrichment{
        <<Abstract>>
        +DataFrame population
        +EnrichmentSource source
        +String feature_name
        +int seed
        +assign_features()
        +compare_with_source()
    }

    class Bhepop2Enrichment{
        +MarginalDistributions source
        -_optimise()
        -_get_feature_probs()
    }
   }

namespace Sources {

    class EnrichmentSource{
        <<Abstract>>
        +any data_distributions
        +get_value_for_feature(feature)
        +compare_with_populations(populations, feature_name)
    }

    class MarginalDistributions{
        +list attribute_selection
        +dict modalities
        +get_modality_distribution()
        +compute_feature_prob(attribute, modality)
    }
}

namespace Distributions {
    class Distribution{
        <<Abstract>>
        +any some_attributes
        +get_prob_of_feature_state(feature_state)
        +evaluate_distribution_on_population(column)
    }

    class QualitativeDistribution{
        +list values
        +list probs
    }

    class DecilesDistribution{
        +list deciles
    }
}

Related issues

This should solve at least partially the following issues:

Questions

Some points are still not clear about such implementation:

Contribution

This work will likely take some time. Since there are likely no other users of bhepop2 than us, no new methodologies and source types are expected to be developed, so this is a lot of refactoring for no real use.

However, if you are interested in adding code to Bhepop2 and that this work would help you or you would like to initiate it, feel free to contact us !