Table of Contents
-
About The Project
-
Installation
-
Usage
-
Contributing
-
License
# What is FakeLake ?
FakeLake is a command line tool that generates fake data from a YAML schema. It can generate millions of rows in seconds, and is order of magnitude faster than popular Python generators (
see benchmarks).
FakeLake is actively developed and maintained by [SOMA](https://www.linkedin.com/company/soma-smart/mycompany/) in Paris 🦊.
```mermaid
flowchart TD
subgraph Z["How it works"]
direction LR
Y[YAML file description] --> F
F[FakeLake] --> O[Output file in CSV, Parquet, ...]
end
```
Any feedback is welcome!
## Features
- Very fast
- Easy to use
- Small memory footprint
- Small binary size
- Robust / no unsafe code
- No dependencies
- Cross-platform (Windows, Linux, Mac OS X)
- MIT license
## Built with
## Benchmark
Benchmark of FakeLake, Mimesis and Faker:
- Goal: Generate 1 million rows with one column: random string (length 10)
- Specs: Windows, AMD Ryzen 5 7530U, 8Go RAM, SSD
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|:---|---:|---:|---:|---:|
| `fakelake generate bench\fakelake_input.yaml` | 252.8 ± 3.3 | 249.0 | 260.0 | 1.00 |
| `python bench\mimesis_bench.py` | 3374.9 ± 21.3 | 3353.0 | 3426.2 | 13.35 ± 0.19 |
| `python bench\faker_bench.py` | 13552.7 ± 340.5 | 13336.4 | 14446.4 | 53.62 ± 1.52 |
Build the benchmark yourself with scripts/benchmark.sh
# Installation
## Simple way : With precompiled binaries
Download the latest release from [here](https://github.com/soma-smart/Fakelake/releases)
```bash
$ tar -xvf Fakelake_
_.tar.gz
$ ./fakelake --help
```
## From source
```bash
$ git clone
$ cd fakelake
$ cargo build --release
$ ./target/release/fakelake --help
```
# How to use it
Generate from one or multiple files
```bash
$ fakelake generate tests/parquet_all_options.yaml
$ fakelake generate tests/parquet_all_options.yaml tests/csv_all_options.yaml
```
The configuration file used contains a list of columns, with a specified provider (for the column behavior), as well as some options.
There is also an info structure to define the output.
```yaml
columns:
- name: id
provider: Increment.integer
start: 42
presence: 0.8
- name: company_email
provider: Person.email
domain: soma-smart.com
- name: created
provider: Random.Date.date
format: "%Y-%m-%d"
after: 2000-02-15
before: 2020-07-17
- name: name
provider: Random.String.alphanumeric
info:
output_name: all_options
output_format: parquet
rows: 1_234_567
```
## Providers
A provider follows a naming rule as "Category.\.provider".
Few examples:
- Person.email
- Increment.integer
- Random.String.alphanumeric
## Options
There is two types of options:
- Options linked to the provider (date and format)
- Options linked to the column (% presence)
## Generation Details
There is three optional fields:
- output_name: To specify the location and name of the output
- output_format: To specify the generated format (we support Parquet and CSV for now)
- rows: To specify the number of rows to generate
# Contributing
Contributions are welcome! Feel free to submit pull requests.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
# License
Distributed under the MIT License. See `LICENSE.txt` for more information.