sodascience / metasyn

Transparent and privacy-friendly synthetic data generation
https://metasyn.readthedocs.io
MIT License
39 stars 9 forks source link
metadata open-data privacy synthetic-data

Metasyn logo

Transparent and privacy-friendly synthetic data generation

Project Status: Active – The project has reached a stable, usable state and is being actively developed. metasyn on pypi open getting started on colab Readthedocs Docker image version DOI


Generate synthetic tabular data in a transparent, understandable, and privacy-friendly way. Metasyn makes it possible for owners of sensitive data to create test data, do open science, improve code reproducibility, encourage data reuse, and enhance accessibility of their datasets, without worrying about leaking private information.

With metasyn you can fit a model to an existing dataframe, save it to a transparent and auditable .json file, and synthesize a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.

Highlights

Installation

Metasyn can be installed directly from PyPI using the following command in the terminal:

pip install metasyn

The latest (possibly unstable) development version can be installed directly from GitHub like so:

pip install git+https://github.com/sodascience/metasyn

Usage

demo

To generate synthetic data, metasyn first needs to fit a MetaFrame to the data which can then be used to produce new synthetic rows:

Example input and output

The above image closely matches the Python code:

import polars as pl
from metasyn import MetaFrame, demo_file

# Get the csv file path for built-in demo dataset
csv_path = demo_file("fruit")

# Create a polars dataframe from the csv file.
# It is important to ensure the data types are correct  
# when creating your dataframe, especially categorical data!
df = pl.read_csv(csv_path, schema_overrides={
  "fruits": pl.Categorical, 
  "cars": pl.Categorical
})

# Create a MetaFrame from the DataFrame.
mf = MetaFrame.fit_dataframe(df)

# Generate a new DataFrame with 5 rows from the MetaFrame.
df_synth = mf.synthesize(5)

# This DataFrame can be exported to csv, parquet, excel and more.
df_synth.write_csv("output.csv")

To explore more options and try this out online, take a look at our interactive tutorial:

For more information on how to create dataframes with polars, refer to the Polars documentation. Metasyn also works with pandas dataframes!

Where to go next

Contributing

Metasyn is an open-source project, and we welcome contributions from the community, from bug reports & feature requests to code contributions. Read our contributing guidelines for more information and to get started!

Contact

Metasyn is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? Create an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.

SoDa logo