sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.34k stars 310 forks source link

Problems with categorical columns and HMASynthesizer #1651

Closed petterlindgren closed 11 months ago

petterlindgren commented 1 year ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I'm running multitable data with HMASynthesizer, two tables with of shape (128986, 18) and (131070, 15) espectively. After fitting the data the results are very poor with Overall Quality Score: 33.6%. For example , one categorical column in the syntetic data contains 99.7 % of one category while that category is the least frequent one (0.0061%) in real data. I use dafault distribution 'beta' and uniformencoder as transformer for that variable.

What I already tried

I've tried different metadata but the problem stays the same. I've tried to do the synthetic generation for only a small part of original data and then it performs slightly better. For GAN you can run the model with more epochs to see if it performs better, but how to do for Gaussiancoupla?

synthesizer_adj.fit(multi_table_data)
npatki commented 1 year ago

Hi @petterlindgren, thanks for filing this issue. That's certainly unexpected. We can help you debug and optimize if you're able to provide a bit more information about where you're noticing the issue.

For example, it is more prevalent in columns of the child table vs. the parent? How many child rows does each parent row have? If you're able to share the metadata of one of these datasets, that would be great.

I've tried to do the synthetic generation for only a small part of original data and then it performs slightly better.

When you say "only a small part," I'm curious whether you tried a single table? Or did you try on a smaller subsample of both tables (parent and child)?

petterlindgren commented 1 year ago

Hi Neha,

Thanks for reaching out so quickly. The column I refer to (var6) is only present in the child table (address) so it is much more prevalent there:) Most rows from the parent table (owners) have only one row in the address table. But some have more than one.

The metadata for address table is

    "address": {
        "columns": {
            "var1": {
                "sdtype": "id"
            },
            "var2": {
                "sdtype": "categorical"
            },
            "var3": {
                "sdtype": "name",
                "pii": true
            },
            "var4": {
                "sdtype": "numerical"
            },
            "var5": {
                "sdtype": "address",
                "pii": true
            },
            "var6": {
                "sdtype": "categorical"
            },
            "var7": {
                "sdtype": "categorical"
            },
            "var8": {
                "sdtype": "address",
                "pii": true
            },
            "var9": {
                "sdtype": "address",
                "pii": true
            },
            "var10": {
                "sdtype": "address",
                "pii": true
            },
            "var11": {
                "sdtype": "address",
                "pii": true
            },
            "owner_ID": {
                "sdtype": "id"
            },
            "var13": {
                "sdtype": "categorical"
            },
            "var14": {
                "sdtype": "categorical"
            },
            "var15": {
                "sdtype": "datetime",
                "datetime_format": "%Y-%m-%d %H:%M:%S"
            },
            "var16": {
                "sdtype": "categorical"
            },
            "var17": {
                "sdtype": "datetime",
                "datetime_format": "%Y-%m-%d %H:%M:%S"
            },
            "var18": {
                "sdtype": "categorical"
            }
        },
        "primary_key": "var1"
    }

and relationship:

"relationships": [ { "parent_table_name": "owners", "child_table_name": "address", "parent_primary_key": "Owner_ID", "child_foreign_key": "Owner_ID" } ],

for the "only a small part", I did it for both tables with the relationship intact (the first 100 owners and the rows from address table that correspond to those owners). I also run GAN on single tables and it works fine.

Den ons 25 okt. 2023 kl 23:28 skrev Neha Patki @.***>:

Hi @petterlindgren https://github.com/petterlindgren, thanks for filing this issue. That's certainly unexpected. We can help you debug and optimize if you're able to provide a bit more information about where you're noticing the issue.

For example, it is more prevalent in columns of the child table vs. the parent? How many child rows does each parent row have? If you're able to share the metadata of one of these datasets, that would be great.

I've tried to do the synthetic generation for only a small part of original data and then it performs slightly better.

When you say "only a small part," I'm curious whether you tried a single table? Or did you try on a smaller subsample of both tables (parent and child)?

— Reply to this email directly, view it on GitHub https://github.com/sdv-dev/SDV/issues/1651#issuecomment-1780081595, or unsubscribe https://github.com/notifications/unsubscribe-auth/APIYZBAOLZVUWRJ727DFHYDYBF777AVCNFSM6AAAAAA6POFOTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBQGA4DCNJZGU . You are receiving this because you were mentioned.Message ID: @.***>

npatki commented 1 year ago

Hi @petterlindgren, thanks for sharing the metadata.

Most rows from the parent table (owners) have only one row in the address table.

I think this is the problem. The HMA is suited for datasets where a parent has a lot of children. This is because the algorithm is attempting to learn the distribution of children for each parent row. So if there is only 1 child, then the distribution might even be undefined (you can't really compute a distribution based on only 1 data point).

I'm curious what the data represents and what your overall project is for? Some ideas:

npatki commented 11 months ago

Hi @petterlindgren, are you still working on the project? The issue has been inactive for some time so I am closing it off.

Since the HMA Synthesizer is designed to understand the correlations between parent and child tables, it is suited for data that has many children (per parent row). Do let us know if you were able to solve your problem with denormalization or if you're running into other concerns. We can always reopen the issue to investigage.

Additionally, I have created a new feature request in #1688 to track any overall changes for HMA to be able to accommodate such datasets.

petterlindgren commented 11 months ago

Hi Neha,

We tried two options:

  1. Join all tables and synthesise the joined tables. Then un-join to get the synthesised tables in the original format.
  2. Synthesise all single-tables and fix the relation manually afterwards according to statistics from original tables.

In our case we were not really interested in correlation of values between tables and therefore we went for 2. Then we could have control over the distribution of number of children per parent

Thanks for your support.

/Petter

Den ons 22 nov. 2023 kl 03:26 skrev Neha Patki @.***>:

Closed #1651 https://github.com/sdv-dev/SDV/issues/1651 as completed.

— Reply to this email directly, view it on GitHub https://github.com/sdv-dev/SDV/issues/1651#event-11031111448, or unsubscribe https://github.com/notifications/unsubscribe-auth/APIYZBET3PGXSVTVPZEXUEDYFVPF7AVCNFSM6AAAAAA6POFOTGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGAZTCMJRGE2DIOA . You are receiving this because you were mentioned.Message ID: @.***>

npatki commented 11 months ago

No problem. Thanks for letting us know. Always here to help if you run into any other issues.