Generate synthetic tabular data in a transparent, understandable, and privacy-friendly way. Metasyn makes it possible for owners of sensitive data to create test data, do open science, improve code reproducibility, encourage data reuse, and enhance accessibility of their datasets, without worrying about leaking private information.
With metasyn you can fit a model to an existing dataframe, save it to a transparent and auditable .json
file, and synthesize a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.
Metasyn can be installed directly from PyPI using the following command in the terminal:
pip install metasyn
The latest (possibly unstable) development version can be installed directly from GitHub like so:
pip install git+https://github.com/sodascience/metasyn
To generate synthetic data, metasyn
first needs to fit a MetaFrame
to the data which can then be used to produce new synthetic rows:
The above image closely matches the Python code:
import polars as pl
from metasyn import MetaFrame, demo_file
# Get the csv file path for built-in demo dataset
csv_path = demo_file("fruit")
# Create a polars dataframe from the csv file.
# It is important to ensure the data types are correct
# when creating your dataframe, especially categorical data!
df = pl.read_csv(csv_path, schema_overrides={
"fruits": pl.Categorical,
"cars": pl.Categorical
})
# Create a MetaFrame from the DataFrame.
mf = MetaFrame.fit_dataframe(df)
# Generate a new DataFrame with 5 rows from the MetaFrame.
df_synth = mf.synthesize(5)
# This DataFrame can be exported to csv, parquet, excel and more.
df_synth.write_csv("output.csv")
To explore more options and try this out online, take a look at our interactive tutorial:
For more information on how to create dataframes with polars, refer to the Polars documentation. Metasyn also works with pandas dataframes!
Metasyn is an open-source project, and we welcome contributions from the community, from bug reports & feature requests to code contributions. Read our contributing guidelines for more information and to get started!
Metasyn is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? Create an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren or Raoul Schram.