Improving Multi-Table Synthetic Data (Healthcare dataset) -- NaN values getting created

npatki commented 8 months ago

I'm filing this issue on behalf of a user.

Environment details

SDV version: ?

Problem description

We tried to do an HMA synthesizer on three tables

MemInput_COM_2019 with columns Member_ID, Age, Gender and Exposure_Moths. Basically a membership dataset. Total around 150k records.
PharmInput_COM_2019 with columns Member_ID, NDC, FillDate, MR_Allowed, MR_Paid, Days_Supplied and Qty_Dispensed. Basically a drug dataset. Total around 1,227k records.
MedInput_COM_2019 with columns Member_ID, ToDate, ICDDiag01-25, ProcCode, POS, MR_Allowed and MR_Paid. Basically a medical diagnosis dataset. Total around 3,404k records.

The tables are linked by one key Member_ID. However, when we generated synthesized data with 1% portion, relationships between dates and NDC and ICD codes do not seem to show up properly, from the screenshots for synthesized datasets. Can you advise how we might be able to improve it? Thanks.

npatki commented 8 months ago

Hello,

I just wanted to confirm my understanding of the problem:

The synthesizer is faithfully reconstructing NDC and IDC codes that were present in the original data. It is not inventing entirely new or invalid NDC/IDC codes -- such as random, missing values or codes that do not make sense.
For a given member (a row in MemInput_COM_2019), you are looking at the associated drugs (rows in PharmInput_COM_2019) as well as associated medical diagnoses (MedInput_COM_2019). Some of these associations are not realistic. For example, you may be seeing a specific drug (Acetaminophen) that is not useful for a diagnosis (Diabetes).

Could you confirm if this is accurate?

Additional Info

It would also be useful if you could provide a bit more information about how the three tables are connected/what they represent.

In MedInput_COM_2019, I see that there are 25 columns for ICD Diagnoses. Does this mean:
- That there are up to 25 diagnoses possible per person?
- That if there are <25 diagnoses, there are NaNs for the remaining ones? Eg. you may fill up ICDDiag01-ICDDiag06, but then leave ICDDiag07-ICDDiag25 blank?
Are there any restrictions for the number of connections between the tables?
- From the sizes of the tables, it appears that there are many members (MemInput_COM_2019) that do not have any associated diagnoses or drugs?
- Is it possible for a member to have a diagnosis but no drugs? Is it possible for a member to have a drug but no diagnosis?
- Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in MedInput_COM_2019? Or is it at most 1 row?
- Is it possible for a member (row in MemInput_COM_2019) to correspond to 2 or more rows in PharmInput_COM_2019? Or is it at most 1 row?

leeyuntien commented 8 months ago

Yes, for points 1 and 2 mentioned above they are accurate. Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

For questions on MedInput_COM_2019, yes there are up to 25 diagnoses possible per person, and if there are <25 diagnoses the remaining ones are left blank. For questions on restrictions for the number of connections between the tables, there are no restrictions ie there could be members without any Med or Pharm, also there could be other members with more than one Med or Pharm or both.

npatki commented 8 months ago

Thanks for the information. Very helpful. We can focus on this:

Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.

Missing Values

You are saying that the real data does not have any missing values (all values are filled in), but the synthetic data does have missing values.

In this case, I believe the root cause is issue #1691 -- there is currently a bug in the HMASynthesizer that we hope to fix soon. I have included 2 possible workarounds in that issue.

Out-of-Range Values

By default, the HMASynthesizer should note down the min/max value of each column in the original data. It should ensure that the synthetic data does not go out-of-bounds. Is this not the case for your data?

Would you be able to provide more details as to which particular column(s) this is happening for?

Better yet -- I would recommend running the Diagnostic Report on the real vs. synthetic data. This report is designed to capture and provide more insights into the exact problems you're mentioning (inventing new values like NaN, and going out-of-bounds). If the score is not 1.0 here, it means there is a bug. You can share with us any detailed breakdowns where you are noticing that the score is <1.0.

leeyuntien commented 8 months ago

Sure will see if a diagnosis report can be generated.

leeyuntien commented 8 months ago

Just updated to sdv 1.9.0 and the learning process of HMASynthesizer.fit finished with the same set of tables ie parent MemInput_COM_2019 table linked to two children tables PharmInput_COM_2019 and MedInput_COM_2019 by Member_ID. However the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\scipy\stats_continuous_distns.py:700: RuntimeWarning: Error in function boost::math::tgamma(%1%,%1%): Series evaluation exceeded %1% iterations, giving up now. return _boost._beta_ppf(q, a, b) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True) flat_parameters = parent_row[keys].fillna(0) Traceback (most recent call last): File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform reversed_data[column_name] = reversed_data[column_name].astype(dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\generic.py", line 6637, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 431, in astype return self.apply( ^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 364, in apply applied = getattr(b, f)(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\blocks.py", line 758, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 237, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 182, in astype_array values = _astype_nansafe(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 101, in _astype_nansafe return _astype_float_to_int_nansafe(arr, dtype, copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 145, in _astype_float_to_int_nansafe raise IntCastingNaNError( pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\base.py", line 393, in sample sampled_data = self._sample(scale=scale) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample self._sample_children(table_name=table, sampled_data=sampled_data) File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children self._add_child_rows( File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows sampled_rows = self._sample_rows(child_synthesizer, num_rows) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 602, in _sample_batch sampled, num_valid = self._sample_rows( ^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\single_table\base.py", line 519, in _sample_rows sampled = self._data_processor.reverse_transform(raw_sampled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform raise ValueError(e) ValueError: Cannot convert non-finite values (NA or inf) to integer

leeyuntien commented 8 months ago

Also in sdv 1.9.0 there is no HSASynthesizer?

Traceback (most recent call last): File "", line 1, in ImportError: cannot import name 'HSASynthesizer' from 'sdv.multi_table' (C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table__init__.py)

npatki commented 8 months ago

Hi @leeyuntien, thanks for getting back. Were you able to resolve the original problem at the beginning of this issue? Or are you retrying everything with the newest SDV version now?

Error Message

the following error messages were generated when calling HMASynthesizer.sample. Do you know why?

This is strange indeed because the actual line of code that is causing the issue is not supposed to crash. We are actually excepting the ValueError and allowing the sampling to proceed.

https://github.com/sdv-dev/SDV/blob/334ba02a3494ab1083915888d7fb06ec8ff0f86e/sdv/data_processing/data_processor.py#L905-L908

The fact that yours crashes anyways (with a ValueError) probably means the newest version of SDV (1.9.0) is not being used for some reason.

In the past, I've noticed that there are sometimes caching issues if you are using a notebook type environment. To sanity check, could you run the following and verify that it prints '1.9.0'?

import sdv
print(sdv.__version__)

HSA

Also in sdv 1.9.0 there is no HSASynthesizer?

The HSASynthesizer is available in the SDV Enterprise SDK, not the public SDV. To get access to the SDV Enterprise SDK, you'd need to purchase a license with us.

More resources:

HSASynthesizer Docs -- you'll see in the docs that enterprise features are marked with an asterisk with more info
Public SDV vs. SDV Enterprise features -- a full list of features for what is available in the SDKs

leeyuntien commented 8 months ago

sdv version

npatki commented 8 months ago

Hi @leeyuntien thanks for confirming. We were able to dig in a little further and looks like it is actually happening due to the same as issue #1691 (linked above). Have you tried the workarounds listed in that issue (using 'norm' or 'truncnorm')?

Something else that might help as a workaround: If any columns are stored as integers in memory (in Python), I would casting them to float for the sake of running them through SDV. To see which column(s) are represented as ints, you run the following for each of the table names:

print(data[TABLE_NAME].dtypes)

Then you can convert any column that are listed as int or int64 into floats:

data[TABLE_NAME][COLUMN_NAME] = data[TABLE_NAME][COLUMN_NAME].astype('float')

The good news is that we are actively working on the underlying issue and hope to have a fix up in the near future. Thanks for bearing with us.

leeyuntien-milli commented 8 months ago

Just tried the workarounds listed in the issue but still got this message. Will change int to float to test.

Traceback (most recent call last): File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform reversed_data[column_name] = reversed_data[column_name].astype(dtype) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5546, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 595, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 406, in apply applied = getattr(b, f)(**kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 595, in astype values = astype_nansafe(vals1d, dtype, copy=True) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 966, in astype_nansafe raise ValueError("Cannot convert non-finite values (NA or inf) to integer") ValueError: Cannot convert non-finite values (NA or inf) to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\multi_table\base.py", line 393, in sample sampled_data = self._sample(scale=scale) File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 222, in _sample self._sample_children(table_name=table, sampled_data=sampled_data) File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 142, in _sample_children self._add_child_rows( File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 108, in _add_child_rows sampled_rows = self._sample_rows(child_synthesizer, num_rows) File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\sampling\hierarchical_sampler.py", line 71, in _sample_rows return synthesizer._sample_batch(int(num_rows), keep_extra_columns=True) File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 602, in _sample_batch sampled, num_valid = self._sample_rows( File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\single_table\base.py", line 519, in _sample_rows sampled = self._data_processor.reverse_transform(raw_sampled) File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 920, in reverse_transform raise ValueError(e) ValueError: Cannot convert non-finite values (NA or inf) to integer

npatki commented 8 months ago

Sounds good. The change to 'truncnorm' or 'norm' generally make it less likely to run into the problem but it is not guaranteed. I hope the workaround from int to float is able to resolve the crash.

leeyuntien-milli commented 8 months ago

All datasets were put into the fitting process but a portion of 0.01 was used to sample.

mem table seems normal

However, there are still NaN's and NaT's in med and pharm tables.

npatki commented 8 months ago

Great to hear that it's no longer crashing! This was the immediate goal so at least you have some synthetic data to work with for v1.9.0.

The NaN values are expected right now due to issue #1691. Since the suggested workaround* is not guaranteed, you would have to wait until we resolve this issue. Rest assured that we are actively looking into the root cause and hope to have a resolution in a future release.

*Suggested workaround is to use 'truncnorm' (or 'norm'). You may want to try using 'trunnorm' in addition to converting the columns to floats. This, too, would be a temporary workaround that is not 100% guaranteed at the moment.

npatki commented 8 months ago

Hi @leeyuntien -- good news! We have released an updated version of SDV (v1.10.0) that should resolve this issue.

You should no longer have to apply any workarounds. The HMASynthesizer should now be able to run by default without running into any Errors and without creating any unnecessary NaN/NaT values.

Please upgrade to the latest version and give it a try. If you continue to run into this problem, feel free to reply and we can always re-open the issue to continue the investigation. (For any other problems unrelated to NaNs, please feel free to file a new issue.) Thanks.

leeyuntien-milli commented 7 months ago

sdv has been updated to 1.10.0 but there are still NaNs and NaTs in the synthesized datasets even if there are none of them in the source datasets, can you advise other ways to deal with it?

npatki commented 7 months ago

Hi @leeyuntien-milli, sorry to hear that. I'm reopening the issue for discussion.

Just to confirm, upgrading to SDV 1.10.0 means that you'd have to create and train a new synthesizer on 1.10.0 (it is not sufficient to load in an pre-existing synthesizer on 1.10.0). Confirming that that is what you've done?

Since our bug fix went out to 1.10.0, I'm wondering if something else is going on now. (I can confirm that our HSA algorithm works ok, but it seems maybe something is still wrong with the public HMA.) I am wondering if you could provide more information?

Can you show us the metadata schema visualization for this? I think you have 3 tables. Are they connected in a straight line A --> B --> C or is it branched? Using metadata.visualize() will be insightful
Which columns are having this problem? I am particularly interested in whether it is only the columns of one particular table (eg. a child table or parent table). And whether they are only of a particular type (eg. datetime)

That will help us narrow down what's going wrong.

leeyuntien-milli commented 7 months ago

print(metadata.visualize())

digraph Metadata { node [fillcolor=lightgoldenrod1 shape=Mrecord style=filled] mem [label="{mem|Member_ID : id\lDOB : datetime\lGender : categorical\lExposure_Months : numerical\l|Primary key: Member_ID\l}"] med [label="{med|Member_ID : id\lClaimID : unknown\lFromDate : datetime\lToDate : datetime\lPaidDate : datetime\lICDDiag01 : categorical\lICDDiag02 : categorical\lICDDiag03 : categorical\lICDDiag04 : categorical\lICDDiag05 : categorical\lICDDiag06 : categorical\lICDDiag07 : categorical\lICDDiag08 : categorical\lICDDiag09 : categorical\lICDDiag10 : categorical\lICDDiag11 : categorical\lICDDiag12 : categorical\lICDDiag13 : categorical\lICDDiag14 : categorical\lICDDiag15 : categorical\lICDDiag16 : categorical\lICDDiag17 : categorical\lICDDiag18 : categorical\lICDDiag19 : categorical\lICDDiag20 : categorical\lICDDiag21 : categorical\lICDDiag22 : categorical\lICDDiag23 : categorical\lICDDiag24 : categorical\lICDDiag25 : categorical\lICDDiag26 : categorical\lICDDiag27 : categorical\lICDDiag28 : categorical\lICDDiag29 : categorical\lICDDiag30 : categorical\lProcCode : categorical\lPOS : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"] pharm [label="{pharm|Member_ID : id\lNDC : categorical\lClaimID : unknown\lFillDate : datetime\lProviderID : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\lDays_Supplied : numerical\lQty_Dispensed : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"] mem -> med [label=" Member_ID → Member_ID" arrowhead=oinv] mem -> pharm [label=" Member_ID → Member_ID" arrowhead=oinv] }

leeyuntien-milli commented 7 months ago

there is no NA's in synthetic_data['mem'] synthetic_data['med'] shows NaT's only in columns ['FromDate', 'ToDate', 'PaidDate'] synthetic_data['pharm'] shows NaT's in column ['FillDate'] and NaN's in columns ['MR_Allowed', 'MR_Paid', 'Days_Supplied', 'Qty_Dispensed']

npatki commented 7 months ago

Hi @leeyuntien-milli could you copy-paste the visualization of the metadata when you do metadata.visualize(). Similar to what we have in the demo notebook, this command should render an actual image. Visuals are more helpful for us to understand your metadata.

Or if it's easier, please share your metadata JSON (accessible by print(metadata) or metadata.save_to_json()). Thanks.

Example:

leeyuntien-milli commented 7 months ago

metadata.pdf

leeyuntien-milli commented 7 months ago

print(metadata) { "tables": { "mem": { "primary_key": "Member_ID", "columns": { "Member_ID": { "sdtype": "id" }, "DOB": { "sdtype": "datetime" }, "Gender": { "sdtype": "categorical" }, "Exposure_Months": { "sdtype": "numerical" } } }, "med": { "columns": { "Member_ID": { "sdtype": "id" }, "ClaimID": { "sdtype": "unknown", "pii": true }, "FromDate": { "sdtype": "datetime" }, "ToDate": { "sdtype": "datetime" }, "PaidDate": { "sdtype": "datetime" }, "ICDDiag01": { "sdtype": "categorical" }, "ICDDiag02": { "sdtype": "categorical" }, "ICDDiag03": { "sdtype": "categorical" }, "ICDDiag04": { "sdtype": "categorical" }, "ICDDiag05": { "sdtype": "categorical" }, "ICDDiag06": { "sdtype": "categorical" }, "ICDDiag07": { "sdtype": "categorical" }, "ICDDiag08": { "sdtype": "categorical" }, "ICDDiag09": { "sdtype": "categorical" }, "ICDDiag10": { "sdtype": "categorical" }, "ICDDiag11": { "sdtype": "categorical" }, "ICDDiag12": { "sdtype": "categorical" }, "ICDDiag13": { "sdtype": "categorical" }, "ICDDiag14": { "sdtype": "categorical" }, "ICDDiag15": { "sdtype": "categorical" }, "ICDDiag16": { "sdtype": "categorical" }, "ICDDiag17": { "sdtype": "categorical" }, "ICDDiag18": { "sdtype": "categorical" }, "ICDDiag19": { "sdtype": "categorical" }, "ICDDiag20": { "sdtype": "categorical" }, "ICDDiag21": { "sdtype": "categorical" }, "ICDDiag22": { "sdtype": "categorical" }, "ICDDiag23": { "sdtype": "categorical" }, "ICDDiag24": { "sdtype": "categorical" }, "ICDDiag25": { "sdtype": "categorical" }, "ICDDiag26": { "sdtype": "categorical" }, "ICDDiag27": { "sdtype": "categorical" }, "ICDDiag28": { "sdtype": "categorical" }, "ICDDiag29": { "sdtype": "categorical" }, "ICDDiag30": { "sdtype": "categorical" }, "ProcCode": { "sdtype": "categorical" }, "POS": { "sdtype": "categorical" }, "MR_Allowed": { "sdtype": "numerical" }, "MR_Paid": { "sdtype": "numerical" } } }, "pharm": { "columns": { "Member_ID": { "sdtype": "id" }, "NDC": { "sdtype": "categorical" }, "ClaimID": { "sdtype": "unknown", "pii": true }, "FillDate": { "sdtype": "datetime" }, "ProviderID": { "sdtype": "categorical" }, "MR_Allowed": { "sdtype": "numerical" }, "MR_Paid": { "sdtype": "numerical" }, "Days_Supplied": { "sdtype": "numerical" }, "Qty_Dispensed": { "sdtype": "numerical" } } } }, "relationships": [ { "parent_table_name": "mem", "child_table_name": "med", "parent_primary_key": "Member_ID", "child_foreign_key": "Member_ID" }, { "parent_table_name": "mem", "child_table_name": "pharm", "parent_primary_key": "Member_ID", "child_foreign_key": "Member_ID" } ], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }

npatki commented 7 months ago

Hi @leeyuntien-milli, thank you. I realize you had already sent the metadata before so apologies for the confusion.

Unfortunately, I am not able to reproduce this issue. I am providing some next steps to unblock you asap.

Running Diagnostics

The SDV is designed to only generate NaN/NaT values if it recognizes that NaN/NaT are possible in the real data.

I would strongly recommend running diagnostic report to see what's happening. We expect the score to be 100% (for more info the docs). What is the score for you?

from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████| 52/52 [00:00<00:00, 366.64it/s]
(2/3) Evaluating Data Structure: : 100%|██████████| 3/3 [00:00<00:00, 137.16it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████| 2/2 [00:00<00:00, 36.78it/s]

Overall Score: 100.0%

Properties:
- Data Validity: 100.0%
- Data Structure: 100.0%
- Relationship Validity: 100.0%

If it is 100%, it indicates that the SDV is working as intended. The problem may be in how the data is loaded into Python. Python may be reading in some values as NaN or NaT. Let me know what the score is and we can discuss next steps.

Running Test Data

Using your metadata, I created some random test data. Modeling and sampling using HMA, I did not observe any NaN or NaT values. I have attached it here. Could you try it out?

test_data.zip

leeyuntien-milli commented 7 months ago

Generating report ... (1/3) Evaluating Data Validity: : 100%|████████████████████████████████████████████████| 52/52 [00:10<00:00, 5.10it/s] (2/3) Evaluating Data Structure: : 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 192.03it/s] (3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████| 2/2 [00:01<00:00, 1.83it/s]

Overall Score: 97.56%

Properties:

Data Validity: 92.68%
Data Structure: 100.0%
Relationship Validity: 100.0%

leeyuntien-milli commented 7 months ago

using the test data there are still NaT's and NaN's so maybe there are some settings that are not set properly here

npatki commented 7 months ago

Hi @leeyuntien-milli, thanks for confirming.

Right, if the test data is also producing NaN/NaT values, I wonder if this is related to your Python environment or the way you're loading the data into Python. Could you please share the code you are using to read the data into Python? Along with anything you may be doing to modify that data once it's loaded into Python?

The recommended approach is to use the load_csvs function, as specified in our docs:

from sdv.datasets.local import load_csvs
from sdv.multi_table import HMASynthesizer

# assume you have unzipped tests_data.zip 
data = load_csvs(folder_name='test_data/')

# should you need to inspect it, the data is available under each file name
med_table = data['med']
pharm_table = data['pharm']
mem_table = data['mem']

# NO further modification of the data is necessary
# you can directly use it with SDV
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)

leeyuntien-milli commented 7 months ago

Please refer to the code using your suggested function of load_csvs but the results are similar. The three tables in test_data is put under the folder data/.

from sdv.multi_table import HMASynthesizer
from sdv.metadata import MultiTableMetadata
from sdv.evaluation.multi_table import run_diagnostic
from sdv.datasets.local import load_csvs

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

for table_name in all_data.keys():
  synthesizer.set_table_parameters(
  table_name=table_name,
  table_parameters={
    'enforce_min_max_values': True,
    'default_distribution': 'truncnorm'})

synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Generating report ... (1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1683.87it/s] (2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s] (3/3) Evaluating Relationship Validity: : 100%|█████████████████████████████████████████| 2/2 [00:00<00:00, 128.01it/s]

Overall Score: 94.88%

Properties:

Data Validity: 84.63%
Data Structure: 100.0%
Relationship Validity: 100.0%

npatki commented 7 months ago

@leeyuntien-milli so using the same exact dataset and SDV version, your results are different than what we're seeing. Very interesting. This possibly indicates some issue with the version of other libraries or platform.

Could you provide more information about your setup? This includes:

Python version
Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt)
Your OS (Linux? Windows?) and any other relevant platform details

leeyuntien-milli commented 7 months ago

Python version 3.8.5
Version of other software in your Python environment such as numpy, pandas, scipy, etc. (you can use pip freeze > requirements.txt) requirements.txt
Your OS (Linux? Windows?) and any other relevant platform details Windows 10 Enterprise with Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 3.60 GHz with 128 GB (128 GB usable)

npatki commented 7 months ago

Hi @leeyuntien-milli, thanks for the info. We realized that there is a key difference between my previous comment and the code you provided: In your code, you are using set_table_parameters command to update the distribution to 'truncnorm'. Is this intentional?

For SDV 1.10.0, you no longer need to update the distribution. It works for me if you remove this and just directly fit the synthesizer.

all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)

# directly fit the data
# no need to update the synthesizer
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()

diagnostic_report = run_diagnostic(
    real_data=all_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

Let me know if that works. In the meantime, we will investigate why truncnorm was causing it to create NaN values.

leeyuntien-milli commented 7 months ago

Thanks test_data passed so going to see if original datasets work.

Generating report ... (1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1662.86it/s] (2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s] (3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]

Overall Score: 100.0%

Properties:

Data Validity: 100.0%
Data Structure: 100.0%
Relationship Validity: 100.0%

leeyuntien-milli commented 7 months ago

Fitting through the original datasets shows good results in terms of validity as well.

However we observe some issues which we hope can be resolved with some adjustments in package settings. In med usually FromDate, ToDate, PayDate and ICD codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.

In pharm usually FillDate and NDC codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.

npatki commented 7 months ago

Hi @leeyuntien-milli, thanks for the detailed response. In the interest of keeping our space clean, we usually we keep 1 GitHub issue open per technical problem. Since we were able to resolve the problem of NaNs (and this issue is getting pretty long), let me close this one. Let's use #1848 for this.

sdv-dev / SDV