joke2k / faker

Faker is a Python package that generates fake data for you.
https://faker.readthedocs.io
MIT License
17.76k stars 1.93k forks source link

`DynamicProvider` does not take weights into account. #1892

Closed pauldechorgnat closed 11 months ago

pauldechorgnat commented 1 year ago

DynamicProvider does not take weights into account when we provide an OrderedDict.

Steps to reproduce

from faker import Faker
from faker.providers import DynamicProvider
from typing import OrderedDict
from collections import Counter

elements = ["a", "b", "c", "d", "e"]
probabilities = [.5, .2, .3, .0, .0]

weighted_elements = OrderedDict(zip(elements, probabilities))

provider = DynamicProvider(
    elements=weighted_elements,
    provider_name="nutriscore"
)

f = Faker()
f.add_provider(provider)

results = [f.nutriscore() for _ in range(100_000)]

Counter(results)

This will return something like:

Counter({'d': 20128, 'b': 20071, 'a': 20042, 'e': 19938, 'c': 19821})

Expected behavior

By providing an OrderedDict, we should be able to get data sampled using the weights of the OrderedDict

Actual behavior

Actually, the DynamicProvider does not take weights into account: the above example shows an equal distribution between the elements.

Proposed fix

To allow weighted DynamicProvider, we only need to change the definition of the get_random_value method:

class DynamicProvider(BaseProvider):
    # ...
    def get_random_value(self) -> Any:
        if not self.elements or len(self.elements) == 0:
            raise ValueError("Elements should be a list of values the provider samples from")

        return self.random_elements(self.elements, length=1, use_weighting=True)[0]

It could be interesting to let the user choose if he wants to use the weights or not, even though he gave an OrderedDict. There are two ways to design this:

  1. use an instance variable use_weighting defined in the constructor
  2. use a parameter in the get_random_value

The first option would give us something like:

class DynamicProvider(BaseProvider):
    def __init__(
        self,
        provider_name: str,
        elements: Optional[List] = None,
        generator: Optional[Any] = None,
        use_weighting: bool = False,
    ):
        """..."""
        self.use_weighting = use_weigthing
        # ...

    def get_random_value(self) -> Any:
        if not self.elements or len(self.elements) == 0:
            raise ValueError("Elements should be a list of values the provider samples from")

        return self.random_elements(self.elements, length=1, use_weighting=self.use_weighting)[0]

The second option would give us:

class DynamicProvider(BaseProvider):
    # ...
    def get_random_value(self, use_weighting: bool = True) -> Any:
        if not self.elements or len(self.elements) == 0:
            raise ValueError("Elements should be a list of values the provider samples from")

        return self.random_elements(self.elements, length=1, use_weighting=use_weighting)[0]

The second option is a bit more flexible for the end user. It can also be a combination of the two (define an instance attribute and default the method parameter to None. If a boolean is specified in the method, it takes over the instance attribute).

Let me know if you are interested in the feature and I'll submit a PR.

Thanks for the great work !

fcurella commented 1 year ago

Thank you for looking into this! I'd go with the second solution, seems more straightforward to just pass the argument down.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.