lk-geimfari / mimesis

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
https://mimesis.name
MIT License
4.39k stars 330 forks source link

Reset increment #1263

Closed rgoubet closed 1 year ago

rgoubet commented 1 year ago

Feature request

Unless I missed it, there doesn't seem to be a way to reset increments: if you generate data several times with the same schema, increments will pick up from the previous creation:

from mimesis import Field, Schema

_ = Field()
schema = Schema(schema=lambda: {
    "id": _('increment'),
    'name': _('full_name')})

for i in range(0,5):
    data = schema.create(5)
    print(data[0]['id'])

This returns:

1
6
11
16
21

Thesis

There should be an option to reset the increment each time data is generated.

Reasoning

When creating large amounts of data to export several times, you don't necessarily want increments to become huge.

lk-geimfari commented 1 year ago

Hi! Actually, there is an accumulator argument for such cases: https://mimesis.name/en/master/api.html#mimesis.Numeric.increment

Here is a usage example:

>>> numeric.increment()
1
>>> numeric.increment(accumulator="a")
1
>>> numeric.increment()
2
numeric.increment(accumulator="a")
2
>>> numeric.increment(accumulator="b")
1
>>> numeric.increment(accumulator="a")
3
lk-geimfari commented 1 year ago

In your case, you are using schemas wrong way.

Instead of doing this:

for i in range(0,5):
    data = schema.create(5)
    print(data[0]['id'])

Do this:

for i in schema.create(5):
    print(i['id'])
rgoubet commented 1 year ago

In your case, you are using schemas wrong way.

In my code example, I'm trying to create 5 fullfilled schemas (that I could then export 5 times) based on the same logical schema. And here, I cannot use a new accumulator every time, unless I instantiate a new Schema object every time.

lk-geimfari commented 1 year ago

@rgoubet Sorry, I don't get the idea. Can you, please, illustrate it on example?

rgoubet commented 1 year ago

My use case is that I want to create multiple, large random data sets in Excel files (generated with openpxl) for stress test purposes. So, let's say I want to create 5 files with 1 million rows each (I use 4 columns for readability, while in practice I get 30):

from mimesis import Field, Schema
from openpyxl import Workbook

_ = Field()

schema = Schema(schema=lambda: {
    "id": _('increment'),
    "timestamp": _('datetime'),
    'version': _('version'),
    'e-mail': _('person.email', domains=['argenx.com']),
    'token': _('token_hex'),
}

Now, I'll run a loop for each file, and use the iterator to preserve memory:

for i in range(0,5):
    wb = Workbook(write_only=True)
    ws = wb.create_sheet()
    for ix, v in enumerate(schema.iterator(1_000_000)):
        if ix==0:
            ws.append(list(v.keys())) # write headers
        else:
            ws.append(list(v.values())) # write data
    xl_file = os.path.join(path, f'data{str(i).zfill(3)}.xlsx')
    wb.save(xl_file)
    wb.close()

Now, it's all good, except that the id column increment continues in each file instead of restarting from 1. In my case, that could have been an issue as it can then become a larger number than I would want for the data type I want (turned out ok in the end).

As I said, maybe I missed something, but it would be nice to have a reset option (e.g. in the create and iterator methods) for the increments. Not critical at all, though.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity. It will be closed if no further activity occurs. Thank you for your contributions.