jonathonmellor / mimesis-stats

MIT License
4 stars 0 forks source link

mimesis_stats

This package exists to extend the capabilities of mimesis for use in statistical data pipelines.

mimesis provides fast fake data generation, and comes with a wide range of data providers, formats, structures and localised options. In addition, it provides a schema structure which makes data generation for data frames very easy. Before using this package it is recommended to become familiar with the basics of mimesis fake data generation, such as through this getting started page.

Due to the extensibility, custom data providers can be created for use within the framework. This mimesis_stats package aims to use the framework for use in statistical pipelines, particularly for generating dummy data for surveys.

However, mimesis data generation / providers have two primary limitations this package extension addresses:

mimesis_stats uses a StatsSchema object that allows multiple variables related to one another to be created using methods from MultiVariable.

mimesis_stats adds data providers for: discrete choice distributions, as well as the ability to pass in custom functions, such as those from numpy or scipy, or user defined functions.

To see an example use case of this package scroll to the bottom of this document in the "Working with pandas" section.

Installation

This package is available on PyPI, but it is recommended to install via GitHub while it is in development.

Obtaining the most recent version of the package can be done using:

> pip install git+https://github.com/jonathonmellor/mimesis-stats@main

mimesis_stats providers

The package contains two supplementary providers, the main object of generating mimesis data. One for producing discrete / continuous distributions and the other for dependent multi-variable samples.

Distribution

Ideal for generating categorical data with Distribution.discrete_distribution() or a numerical variable using Distribution.generic_distribution() with a user defined or numpy/scipy function.

All mimesis_stats providers have null_prop and null_value arguments to add in missing at random null values. For multi-variable producers this is done by passing in a list of propritionas and missing values corresponding to each variable made.

Categorical

General use for discrete distributions, the main addition from base mimesis are the weighting and null options.

>>> from mimesis_stats.providers.distribution import Distribution
>>> Distribution().discrete_distribution(
...     population=["First", "Second", "Third"],
...     weights=[0.01, 0.01, 0.98]
... )
"Third"
>>> Distribution().discrete_distribution(
...     population=["Apple", "Banana"],
...     weights=[0.5, 0.5],
...     null_prop=1.0,
...     null_value=None
... )
None

MultiVariable

This provider allows multiple variables dependent or related to each other to be created through one provider call.

In practice, produced dictionary key-value pairs can be separated into different variables.

>>> from mimesis_stats.providers.multivariable import MultiVariable
>>> MultiVariable().dependent_variables(
...     variable_names=["consent", "favourite_fruit"],
...     options=[("Yes", "Lemon"), ("No", None)],
...     weights=[0.7, 0.3]
... )
{"consent": "Yes", "favourit_fruit": "Lemon")

Within the possible combinations other provider calls can be made to extend the complexity of generation.

>>> from mimesis_stats.providers.multivariable import MultiVariable
>>> from mimesis import Food
>>> MultiVariable().dependent_variables(
...     variable_names=["consent", "favourite_fruit"],
...     options=[("Yes", Food.fruit()), ("No", None)],
...     weights=[0.9, 0.1]
... )
{"consent": "Yes", "favourit_fruit": "Banana")

StatsSchema

For generating samples of many variables consistently it is recommended to use a schema. mimesis has a Schema object, however, in order to fully take advantage of the seeding and multi-variable nature of the mimesis_stats.providers approaches StatsSchema should be used instead to define a schema.

A StatsSchema object requies a schema to be passed to it.

A schema/schema_blueprint is a lambda function that contains the code to generate each variable when called.

To define a schema_blueprint a StatsField (equivalent to Field from mimesis) needs to be declared. This sets a seed and a location basis for providers.

The schema_blueprint is then passed to the StatsSchema to define the generator.

Example mimesis_stats schema:

>>> from mimesis_stats.stats_schema import StatsField, StatsSchema
>>> from numpy.random import pareto
>>> field = StatsField(seed=42)
>>> schema_blueprint = lambda: {
...     "name": field("person.full_name"),
...     "salary": field("generic_distribution", func=pareto, a=3)
... }
>>> schema = StatsSchema(schema=schema_blueprint)
>>> schema.create(iterations=1)
[{'name': 'Annika Reilly', 'salary': 0.16932036645405568}]
>>> schema.create(iterations=2)
[{'name': 'Hank Day', 'salary': 1.7274682836709054},
{'name': 'Crystle Osborn', 'salary': 0.5510238033601347}]

Working with pandas

Standard use of the package will be with a dataframe.

The code snippets below outline the suggested approach for generating a dataframe of random data, such as a survey responses.

Consider the following basic survey.

We collect the following information:

The # fmt: off/on lines stop the black formatter changing the schema blueprint.

import pandas as pd
from mimesis_stats.stats_schema import StatsField, StatsSchema
from scipy.stats import truncnorm

# Define parameters of truncated normal
lower = 0
upper = 10
mu_true = 7
mu_false = 4
sigma = 2.5

field = StatsField(seed=42)

# fmt: off
schema_blueprint = lambda: {
    "ID": field("random.custom_code", mask='SCHL#####', digit="#"),
    "email": field("person.email"),
    "occupation": field("person.occupation"),
    "parent_school_importance": field(
        "dependent_variables",
        variable_names=["parent", "school_importance"],
        options=[
            (True, round(truncnorm.rvs(a=(lower-mu_true)/sigma, b=(upper-mu_true)/sigma,
                                    loc=mu_true, scale=sigma))),
            (False, round(truncnorm.rvs(a=(lower-mu_false)/sigma, b=(upper-mu_false)/sigma,
                                    loc=mu_false, scale=sigma)))
        ],
        weights=[0.3, 0.7],
    )
}
# fmt: on

schema = StatsSchema(schema_blueprint)

df = pd.DataFrame(schema.create(iterations=1000))
print(df.head())

Output:

          ID                       email           occupation  parent  school_importance
0  SCHL60227   pyoses1812@protonmail.com             Milklady   False                  8
1  SCHL68040        dreep1871@yandex.com        Choreographer    True                  7
2  SCHL25016  killing1844@protonmail.com            Scientist   False                  7
3  SCHL52580         brach1847@gmail.com  Leaflet Distributor   False                  0
4  SCHL86319     cyrenaic1813@yandex.com         Yacht Master    True                  9
# Check the summary stats of the two distributions
# (remember mean of sample != mean of generation parameter due to truncation)
parent_breakdown = df.groupby("parent").agg(["min", "max", "median", "mean"])

print(parent_breakdown)

Output:

       school_importance
                     min max median      mean
parent
False                  0  10      4  4.219477
True                   0  10      7  6.432692