lk-geimfari / mimesis

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
https://mimesis.name
MIT License
4.39k stars 330 forks source link

Generating data by schema. #199

Closed lk-geimfari closed 6 years ago

lk-geimfari commented 7 years ago

We have implemented very primitive generator by schema:

from mimesis.schema import Schema
from mimesis import constants as c

schema = Schema(locale=c.EN)

# We will use format 'provider.method'
# Param 'schema' should be dict or path to json file.
result = schema.load(schema={
    'username': 'personal.username',
    'password': 'personal.password',
    'full_name': 'datetime.full_name',
}).create()

But is not better solutions. @sobolevn suggest much better solutions using Lazy objects`, which looks like that:

>>> from custom.utils import format_phone
>>> from mimesis.schema import Schema, fields
>>> schema = Schema('en')

>>> schema.load(schema={
...     "id": fields.cryptographic.uuid(version=4),
...     "name":  fields.personal.full_name(gender='female'),
...     "version": fields.development.version(semantic=True),
...     "phone": format_phone(fields.personal.phone),
... }).create(iterations=2)

I like second solution too.

lk-geimfari commented 7 years ago

What do you think? It's can be useful?

lk-geimfari commented 7 years ago

I have already implemented basic functionality. It remains a little to finish.

sobolevn commented 7 years ago

Looks good! But how would you pass params to 'personal.password' for example?

valerievich commented 7 years ago

One more pretty adorable way to generate data I think.

lk-geimfari commented 7 years ago

@sobolevn That is a real trouble. I would suggest personal.surname.female, where .female is first argument, but it's looks not so good if method has many arguments. We'll wait proposals.

sobolevn commented 7 years ago

I think that arguments are really important. They should be both:

  1. Easy to change
  2. Easy to write

Consider this situation: I want to have users with different ages https://github.com/lk-geimfari/mimesis/blob/master/mimesis/providers/personal.py#L29

If user's age is under 18, they are not allowed. Otherwise - allowed. I don't what to create two schemas for that. Since users contain a lot of fields. And I don't want to copy paste it.

Solution: I can create a factory function.

def generate_schema_for_age(age):
   schema ={
        'username': 'personal.username',
        'password': 'personal.password',
        'full_name': 'datetime.full_name',
        'age': age
    }

   return schema.create(schema=schema)

But is it the way we want to go?

lk-geimfari commented 7 years ago

I'm absolutely agree. But i cannot evaluate the complexity of implementation immediately.

lk-geimfari commented 7 years ago

@Valerievich It's done. We need to implement support of arguments.

sobolevn commented 7 years ago

I don't like current implementation. It breaks one major rule: "everything is an object". Our fields right now are not objects in general case. They are strings.

So it could break a lot of things for the end user. Imagine a user has some sort of logic to reformat phone numbers to his specific needs. Like: reformat_phone(value). How is it possible with the current implementation? Or any other functions/classes/etc which wraps values.

What do I suggest?

Lazy objects (or generator)

In my opinion, we should create LazyField wrapper to wrap any other existing field. And a special fields container with all the existing fields wrapped into LazyField. So, how would it work?

>>> from custom.utils import format_phone
>>> from mimesis.schema import Schema, fields
>>> schema = Schema('en')

>>> schema.load(schema={
...     "id": fields.cryptographic.uuid(version=4),
...     "name":  fields.personal.full_name(gender='female'),
...     "version": fields.development.version(semantic=True),
...     "phone": format_phone(fields.personal.phone),
... }).create(iterations=2)

On each iteration lazy objects (or generator) generates new value. User has all the control, code is more pythonic.

Considerations

Do you have any ideas? Am I missing something?

lk-geimfari commented 7 years ago

@sobolevn Of course it's looks much better, than current implementation. I have only one question: How we can generate data by schema.json? Or, maybe it's doesn't matter? Anyway, i have really like idea with fields. I'm all for it.

lk-geimfari commented 7 years ago

@sobolevn Can you explain, please, how to implement LazyField based data generator? I mean steps. Maybe you have link to similar theme? I want to try implement it and close this issue on this week. Thanks you.

sobolevn commented 7 years ago

I came with even better idea: why now implementing custom provider for factory_boy? It already has all the stuff we need!

lk-geimfari commented 7 years ago

@sobolevn I have never worked with this library, but i'll try to figure out how to do it.

lk-geimfari commented 7 years ago

Unfortunately, I did not understand how to add the ability to pass arguments. @sobolevn Can you look at this issue when you'll have free time, please?

sobolevn commented 7 years ago

Sure!

samuarl commented 6 years ago

The randomness distribution in schema generations seems off. Even from many iterations the resultant data usually only contains a handful of unique values per field.

from mimesis.schema import Schema
import pandas as pd

schema = Schema('en')

data = schema.load(schema={
    "Name": "personal.name",
    "Surname": "personal.surname",
    "Username": "personal.username"
}).create(iterations=10000)

# Value counts and markdown tables
df = pd.DataFrame(data)

for name, series in df.iteritems():
    theader = f'{name}|Count'
    trow = ''
    for value, count in series.value_counts().iteritems():
        trow += f'{value}|{count}\n'
    print(f'{theader}\n-|-\n{trow}')

10000 Iterations

Name Count
Wynell 6456
Nakita 1436
Spring 1218
Yuko 890
Surname Count
Shannon 6435
Harrell 1447
Tyler 1212
Vincent 906
Username Count
Vaultier_1956 6445
Compatriot_2055 1452
Dervish.1941 1205
edsel.1825 898
lk-geimfari commented 6 years ago

@samuarl It's strange, because we have disabled seed by default.

lk-geimfari commented 6 years ago

@samuarl I have run this script on my laptop and everything is okay.

lk-geimfari commented 6 years ago

Implemented in 53c8741930fe0c79e605d53817e4ef4bcead0766.