soma-smart / Fakelake

Generate massive fake datasets for your datalake, fast. By SOMA
https://soma-smart.github.io/Fakelake/
MIT License
17 stars 1 forks source link
benchmark big-data data fake mockup parquet soma

FakeLake

GitHub Release Static Badge

GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub Downloads (all assets, all releases) GitHub Repo stars

Table of Contents
  1. About The Project
  2. Installation
  3. Usage
  4. Contributing
  5. License

# What is FakeLake ? FakeLake is a command line tool that generates fake data from a YAML schema. It can generate millions of rows in seconds, and is order of magnitude faster than popular Python generators (see benchmarks). FakeLake is actively developed and maintained by [SOMA](https://www.linkedin.com/company/soma-smart/mycompany/) in Paris 🦊. ```mermaid flowchart TD subgraph Z["How it works"] direction LR Y[YAML file description] --> F F[FakeLake] --> O[Output file in CSV, Parquet, ...] end ``` Any feedback is welcome! ## Features - Very fast - Easy to use - Small memory footprint - Small binary size - Robust / no unsafe code - No dependencies - Cross-platform (Windows, Linux, Mac OS X) - MIT license ## Built with ## Benchmark Benchmark of FakeLake, Mimesis and Faker:
- Goal: Generate 1 million rows with one column: random string (length 10) - Specs: Windows, AMD Ryzen 5 7530U, 8Go RAM, SSD | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `fakelake generate bench\fakelake_input.yaml` | 252.8 ± 3.3 | 249.0 | 260.0 | 1.00 | | `python bench\mimesis_bench.py` | 3374.9 ± 21.3 | 3353.0 | 3426.2 | 13.35 ± 0.19 | | `python bench\faker_bench.py` | 13552.7 ± 340.5 | 13336.4 | 14446.4 | 53.62 ± 1.52 | Build the benchmark yourself with scripts/benchmark.sh # Installation ## Simple way : With precompiled binaries Download the latest release from [here](https://github.com/soma-smart/Fakelake/releases) ```bash $ tar -xvf Fakelake__.tar.gz $ ./fakelake --help ``` ## From source ```bash $ git clone $ cd fakelake $ cargo build --release $ ./target/release/fakelake --help ``` # How to use it Generate from one or multiple files ```bash $ fakelake generate tests/parquet_all_options.yaml $ fakelake generate tests/parquet_all_options.yaml tests/csv_all_options.yaml ```
The configuration file used contains a list of columns, with a specified provider (for the column behavior), as well as some options. There is also an info structure to define the output. ```yaml columns: - name: id provider: Increment.integer start: 42 presence: 0.8 - name: company_email provider: Person.email domain: soma-smart.com - name: created provider: Random.Date.date format: "%Y-%m-%d" after: 2000-02-15 before: 2020-07-17 - name: name provider: Random.String.alphanumeric info: output_name: all_options output_format: parquet rows: 1_234_567 ``` ## Providers A provider follows a naming rule as "Category.\.provider".
Few examples: - Person.email - Increment.integer - Random.String.alphanumeric ## Options There is two types of options: - Options linked to the provider (date and format) - Options linked to the column (% presence) ## Generation Details There is three optional fields: - output_name: To specify the location and name of the output - output_format: To specify the generated format (we support Parquet and CSV for now) - rows: To specify the number of rows to generate # Contributing Contributions are welcome! Feel free to submit pull requests. 1. Fork the Project 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the Branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request # License Distributed under the MIT License. See `LICENSE.txt` for more information.