sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

A large difference between the values ​​in the child table #2142

Open shirtevet opened 1 month ago

shirtevet commented 1 month ago

Hello I create 3 synthetic tables (1 parent and 2 children) In one of the boys' tables there is a numerical column where the values ​​are between 0-1200 but almost all the values ​​are between 0-200, when I take out the score I get that the score is 0.27 and you saw that the values ​​are mainly distributed between 0-500. what can we do? Another explanation - this is a test result column and there are many types of tests that can be seen in another column. Is there a possibility that the range in the column will be according to the type of test?

In addition, I have another categorical column whose graph looks really bad (a photo is also attached to this) צילום מסך 2024-07-21 132128

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

<Replace this text with a description of what you were trying to get done. Tell us what happened, what went wrong, and what you expected to happen.>

Steps to reproduce

<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 1 month ago

Hi @shirtevet, nice to meet you. We'd love to look into this further with you. Could you please share the metadata JSON that you are working with for this dataset?

If the data itself is able to be shared (or publicly available) that would be helpful as well. But at least with the metadata, we can ensure our investigations are matching your schema.

shirtevet commented 1 month ago
import json
metadata8 = {
    "tables": {
        "visits_data": {
            "primary_key": "PATIENTID_VISITDATE",
            "columns": {
                "PATIENTID": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}"},
                "PATIENTID_VISITDATE": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}_\\d{2}-\\d{2}-\\d{4}"},
                "VISITDATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" },
                "VISITTYPECODE": { "sdtype": "categorical" },
                "NOWPREGNANT": { "sdtype": "categorical" },
                "FAMILYPLANNINGID": { "sdtype": "categorical" },
                "FUNCTIONALSTATUSCODE": { "sdtype": "categorical" },
                "WHOSTAGE": { "sdtype": "categorical" },
                "TBSTATUSCODE": { "sdtype": "categorical" },
                "IPTREASONCODE": { "sdtype": "categorical" },
                "ARVSTATUSCODE": { "sdtype": "categorical" },
                "ARVREASONCODE": { "sdtype": "categorical" },
                "ARVCODE": { "sdtype": "categorical" },
                "NOTES": { "sdtype": "text" },
                "STAFFID": { "sdtype": "categorical" },
                "USERNUMBER": { "sdtype": "categorical" },
                "THETIMESTAMP": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" },
                "IMPORTFILE": { "sdtype": "categorical" },
                "FACILITY": { "sdtype": "categorical" },
                "ENTRY_DATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" }
            }
        },
        "test_data": {
            "primary_key": "ID",
            "columns": {
                "ID": { "sdtype": "id" },
                "PATIENTID_VISITDATE": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}_\\d{2}-\\d{2}-\\d{4}" },
                "PATIENTID_VISITDATE_ID": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}_\\d{2}-\\d{2}-\\d{4}_\\d+" },
                "PATIENTID": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}" },
                "TESTTYPEID": { "sdtype": "categorical" },
                "TESTDATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" },
                "RESULTDATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" },
                "RESULTNUMERIC": { "sdtype": "numerical" },
                "RESULTNOTES": { "sdtype": "text" },
                "IMPORTFILE": { "sdtype": "categorical" },
                "FACILITY": { "sdtype": "categorical" },
                "ENTRY_DATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" }
            }
        },
        "medication_data": {
            "primary_key": "PATIENTID_VISITDATE_DRUGTYPEID",
            "columns": {
                "PATIENTID_VISITDATE": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}_\\d{2}-\\d{2}-\\d{4}" },
                "PATIENTID_VISITDATE_DRUGTYPEID": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}_\\d{2}-\\d{2}-\\d{4}_[A-Z]+" },
                "PATIENTID": { "sdtype": "id", "regex_format": "\\d{2}-\\d{2}-\\d{1}[A-Z]-\\d{4}-[A-Z]-\\d{5}" },
                "VISITDATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" },
                "DRUGTYPEID": { "sdtype": "categorical" },
                "QUANTITYDISPENSED": { "sdtype": "numerical" },
                "PERCENTADHERENCE": { "sdtype": "numerical" },
                "NOTES": { "sdtype": "text" },
                "IMPORTFILE": { "sdtype": "categorical" },
                "FACILITY": { "sdtype": "categorical" },
                "ENTRY_DATE": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" }
            }
        }
    },
    "relationships": [
        {
            "parent_table_name": "visits_data",
            "child_table_name": "test_data",
            "parent_primary_key": "PATIENTID_VISITDATE",
            "child_foreign_key": "PATIENTID_VISITDATE"
        },
        {
            "parent_table_name": "visits_data",
            "child_table_name": "medication_data",
            "parent_primary_key": "PATIENTID_VISITDATE",
            "child_foreign_key": "PATIENTID_VISITDATE"
        }
    ]
}
npatki commented 1 month ago

Hi @shirtevet thanks for providing the metadata. Unfortunately, our team was unable to replicate these exact results you were having -- even with HMA, we were able to find that it produces reasonable results for the DrugTypeID column. However, we don't have your data so we are just using random data.

One thing to note is that HMA specifically may have issues when you have a lower cardinality (branching factor). Would you be able to share this info with us: Generally, how many test_data rows and medications_data rows do you have attached to a particular visits_data row?

Almost all the time, switching to HSASynthesizer has produced better results for anyone running into this. You can see a very similar discussion here, where I've shown a comparison of HMA vs. HSA data. (I understand though that you need to be an SDV Enterprise user to use HSA.)

Another explanation - this is a test result column and there are many types of tests that can be seen in another column. Is there a possibility that the range in the column will be according to the type of test?

Are you saying that the TESTTYPEID (categorical) determines the scale of the values that you can have in column RESULTNUMERIC? For example if a test is for systolic blood pressure it can range from 100-200 but if it is for something else like A1C then it will have a completely different range like 4.0-7.0?

If this is the case, there is good news and bad -- The good is that the team working on allowing you to solve this exact problem with a new type of constraint. Unfortunately, this constraint would also only be available to SDV Enterprise users.

If you are interested in SDV Enterprise, you can get in contact via the support page. Due to limitations of HMA/our public features, I can't think of much to do besides use SDV Enterprise. But I will reply here if anything else comes to mind.