Closed npatki closed 7 months ago
Hello,
I just wanted to confirm my understanding of the problem:
MemInput_COM_2019
), you are looking at the associated drugs (rows in PharmInput_COM_2019
) as well as associated medical diagnoses (MedInput_COM_2019
). Some of these associations are not realistic. For example, you may be seeing a specific drug (Acetaminophen) that is not useful for a diagnosis (Diabetes).Could you confirm if this is accurate?
It would also be useful if you could provide a bit more information about how the three tables are connected/what they represent.
MedInput_COM_2019
, I see that there are 25 columns for ICD Diagnoses. Does this mean:
MemInput_COM_2019
) that do not have any associated diagnoses or drugs?MemInput_COM_2019
) to correspond to 2 or more rows in MedInput_COM_2019
? Or is it at most 1 row?MemInput_COM_2019
) to correspond to 2 or more rows in PharmInput_COM_2019
? Or is it at most 1 row?Yes, for points 1 and 2 mentioned above they are accurate. Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.
For questions on MedInput_COM_2019, yes there are up to 25 diagnoses possible per person, and if there are <25 diagnoses the remaining ones are left blank. For questions on restrictions for the number of connections between the tables, there are no restrictions ie there could be members without any Med or Pharm, also there could be other members with more than one Med or Pharm or both.
Thanks for the information. Very helpful. We can focus on this:
Our initial questions would be why there are out-of-range date values and N/A's given no N/A's for columns like NDC, FillDate or MR_Allowed etc in the original datasets.
You are saying that the real data does not have any missing values (all values are filled in), but the synthetic data does have missing values.
In this case, I believe the root cause is issue #1691 -- there is currently a bug in the HMASynthesizer that we hope to fix soon. I have included 2 possible workarounds in that issue.
By default, the HMASynthesizer should note down the min/max value of each column in the original data. It should ensure that the synthetic data does not go out-of-bounds. Is this not the case for your data?
Would you be able to provide more details as to which particular column(s) this is happening for?
Better yet -- I would recommend running the Diagnostic Report on the real vs. synthetic data. This report is designed to capture and provide more insights into the exact problems you're mentioning (inventing new values like NaN, and going out-of-bounds). If the score is not 1.0 here, it means there is a bug. You can share with us any detailed breakdowns where you are noticing that the score is <1.0.
Sure will see if a diagnosis report can be generated.
Just updated to sdv 1.9.0 and the learning process of HMASynthesizer.fit finished with the same set of tables ie parent MemInput_COM_2019 table linked to two children tables PharmInput_COM_2019 and MedInput_COM_2019 by Member_ID. However the following error messages were generated when calling HMASynthesizer.sample. Do you know why?
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\scipy\stats_continuous_distns.py:700: RuntimeWarning: Error in function boost::math::tgammapd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\multi_table\hma.py:444: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set pd.set_option('future.no_silent_downcasting', True)
flat_parameters = parent_row[keys].fillna(0)
Traceback (most recent call last):
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform
reversed_data[column_name] = reversed_data[column_name].astype(dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\generic.py", line 6637, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 431, in astype
return self.apply(
^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\managers.py", line 364, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\internals\blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 101, in _astype_nansafe
return _astype_float_to_int_nansafe(arr, dtype, copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\YunTien.Lee\AppData\Local\miniconda3\Lib\site-packages\pandas\core\dtypes\astype.py", line 145, in _astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "
Also in sdv 1.9.0 there is no HSASynthesizer?
Traceback (most recent call last):
File "
Hi @leeyuntien, thanks for getting back. Were you able to resolve the original problem at the beginning of this issue? Or are you retrying everything with the newest SDV version now?
the following error messages were generated when calling HMASynthesizer.sample. Do you know why?
This is strange indeed because the actual line of code that is causing the issue is not supposed to crash. We are actually excepting the ValueError
and allowing the sampling to proceed.
The fact that yours crashes anyways (with a ValueError
) probably means the newest version of SDV (1.9.0) is not being used for some reason.
In the past, I've noticed that there are sometimes caching issues if you are using a notebook type environment. To sanity check, could you run the following and verify that it prints '1.9.0'
?
import sdv
print(sdv.__version__)
Also in sdv 1.9.0 there is no HSASynthesizer?
The HSASynthesizer is available in the SDV Enterprise SDK, not the public SDV. To get access to the SDV Enterprise SDK, you'd need to purchase a license with us.
More resources:
sdv version
Hi @leeyuntien thanks for confirming. We were able to dig in a little further and looks like it is actually happening due to the same as issue #1691 (linked above). Have you tried the workarounds listed in that issue (using 'norm'
or 'truncnorm'
)?
Something else that might help as a workaround: If any columns are stored as integers in memory (in Python), I would casting them to float for the sake of running them through SDV. To see which column(s) are represented as ints, you run the following for each of the table names:
print(data[TABLE_NAME].dtypes)
Then you can convert any column that are listed as int
or int64
into floats:
data[TABLE_NAME][COLUMN_NAME] = data[TABLE_NAME][COLUMN_NAME].astype('float')
The good news is that we are actively working on the underlying issue and hope to have a fix up in the near future. Thanks for bearing with us.
Just tried the workarounds listed in the issue but still got this message. Will change int to float to test.
Traceback (most recent call last): File "C:\Users\YunTien.Lee\AppData\Roaming\Python\Python38\site-packages\sdv\data_processing\data_processor.py", line 906, in reverse_transform reversed_data[column_name] = reversed_data[column_name].astype(dtype) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5546, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 595, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 406, in apply applied = getattr(b, f)(**kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 595, in astype values = astype_nansafe(vals1d, dtype, copy=True) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 966, in astype_nansafe raise ValueError("Cannot convert non-finite values (NA or inf) to integer") ValueError: Cannot convert non-finite values (NA or inf) to integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "
Sounds good. The change to 'truncnorm'
or 'norm'
generally make it less likely to run into the problem but it is not guaranteed. I hope the workaround from int
to float
is able to resolve the crash.
All datasets were put into the fitting process but a portion of 0.01 was used to sample.
mem table seems normal
However, there are still NaN's and NaT's in med and pharm tables.
Great to hear that it's no longer crashing! This was the immediate goal so at least you have some synthetic data to work with for v1.9.0.
The NaN values are expected right now due to issue #1691. Since the suggested workaround* is not guaranteed, you would have to wait until we resolve this issue. Rest assured that we are actively looking into the root cause and hope to have a resolution in a future release.
*Suggested workaround is to use 'truncnorm'
(or 'norm'
). You may want to try using 'trunnorm'
in addition to converting the columns to floats. This, too, would be a temporary workaround that is not 100% guaranteed at the moment.
Hi @leeyuntien -- good news! We have released an updated version of SDV (v1.10.0) that should resolve this issue.
You should no longer have to apply any workarounds. The HMASynthesizer should now be able to run by default without running into any Errors and without creating any unnecessary NaN/NaT values.
Please upgrade to the latest version and give it a try. If you continue to run into this problem, feel free to reply and we can always re-open the issue to continue the investigation. (For any other problems unrelated to NaNs, please feel free to file a new issue.) Thanks.
sdv has been updated to 1.10.0 but there are still NaNs and NaTs in the synthesized datasets even if there are none of them in the source datasets, can you advise other ways to deal with it?
Hi @leeyuntien-milli, sorry to hear that. I'm reopening the issue for discussion.
Just to confirm, upgrading to SDV 1.10.0 means that you'd have to create and train a new synthesizer on 1.10.0 (it is not sufficient to load in an pre-existing synthesizer on 1.10.0). Confirming that that is what you've done?
Since our bug fix went out to 1.10.0, I'm wondering if something else is going on now. (I can confirm that our HSA algorithm works ok, but it seems maybe something is still wrong with the public HMA.) I am wondering if you could provide more information?
metadata.visualize()
will be insightfulThat will help us narrow down what's going wrong.
print(metadata.visualize())
digraph Metadata { node [fillcolor=lightgoldenrod1 shape=Mrecord style=filled] mem [label="{mem|Member_ID : id\lDOB : datetime\lGender : categorical\lExposure_Months : numerical\l|Primary key: Member_ID\l}"] med [label="{med|Member_ID : id\lClaimID : unknown\lFromDate : datetime\lToDate : datetime\lPaidDate : datetime\lICDDiag01 : categorical\lICDDiag02 : categorical\lICDDiag03 : categorical\lICDDiag04 : categorical\lICDDiag05 : categorical\lICDDiag06 : categorical\lICDDiag07 : categorical\lICDDiag08 : categorical\lICDDiag09 : categorical\lICDDiag10 : categorical\lICDDiag11 : categorical\lICDDiag12 : categorical\lICDDiag13 : categorical\lICDDiag14 : categorical\lICDDiag15 : categorical\lICDDiag16 : categorical\lICDDiag17 : categorical\lICDDiag18 : categorical\lICDDiag19 : categorical\lICDDiag20 : categorical\lICDDiag21 : categorical\lICDDiag22 : categorical\lICDDiag23 : categorical\lICDDiag24 : categorical\lICDDiag25 : categorical\lICDDiag26 : categorical\lICDDiag27 : categorical\lICDDiag28 : categorical\lICDDiag29 : categorical\lICDDiag30 : categorical\lProcCode : categorical\lPOS : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"] pharm [label="{pharm|Member_ID : id\lNDC : categorical\lClaimID : unknown\lFillDate : datetime\lProviderID : categorical\lMR_Allowed : numerical\lMR_Paid : numerical\lDays_Supplied : numerical\lQty_Dispensed : numerical\l|Primary key: None\lForeign key (mem): Member_ID\l}"] mem -> med [label=" Member_ID → Member_ID" arrowhead=oinv] mem -> pharm [label=" Member_ID → Member_ID" arrowhead=oinv] }
there is no NA's in synthetic_data['mem'] synthetic_data['med'] shows NaT's only in columns ['FromDate', 'ToDate', 'PaidDate'] synthetic_data['pharm'] shows NaT's in column ['FillDate'] and NaN's in columns ['MR_Allowed', 'MR_Paid', 'Days_Supplied', 'Qty_Dispensed']
Hi @leeyuntien-milli could you copy-paste the visualization of the metadata when you do metadata.visualize()
. Similar to what we have in the demo notebook, this command should render an actual image. Visuals are more helpful for us to understand your metadata.
Or if it's easier, please share your metadata JSON (accessible by print(metadata)
or metadata.save_to_json()
). Thanks.
Example:
print(metadata) { "tables": { "mem": { "primary_key": "Member_ID", "columns": { "Member_ID": { "sdtype": "id" }, "DOB": { "sdtype": "datetime" }, "Gender": { "sdtype": "categorical" }, "Exposure_Months": { "sdtype": "numerical" } } }, "med": { "columns": { "Member_ID": { "sdtype": "id" }, "ClaimID": { "sdtype": "unknown", "pii": true }, "FromDate": { "sdtype": "datetime" }, "ToDate": { "sdtype": "datetime" }, "PaidDate": { "sdtype": "datetime" }, "ICDDiag01": { "sdtype": "categorical" }, "ICDDiag02": { "sdtype": "categorical" }, "ICDDiag03": { "sdtype": "categorical" }, "ICDDiag04": { "sdtype": "categorical" }, "ICDDiag05": { "sdtype": "categorical" }, "ICDDiag06": { "sdtype": "categorical" }, "ICDDiag07": { "sdtype": "categorical" }, "ICDDiag08": { "sdtype": "categorical" }, "ICDDiag09": { "sdtype": "categorical" }, "ICDDiag10": { "sdtype": "categorical" }, "ICDDiag11": { "sdtype": "categorical" }, "ICDDiag12": { "sdtype": "categorical" }, "ICDDiag13": { "sdtype": "categorical" }, "ICDDiag14": { "sdtype": "categorical" }, "ICDDiag15": { "sdtype": "categorical" }, "ICDDiag16": { "sdtype": "categorical" }, "ICDDiag17": { "sdtype": "categorical" }, "ICDDiag18": { "sdtype": "categorical" }, "ICDDiag19": { "sdtype": "categorical" }, "ICDDiag20": { "sdtype": "categorical" }, "ICDDiag21": { "sdtype": "categorical" }, "ICDDiag22": { "sdtype": "categorical" }, "ICDDiag23": { "sdtype": "categorical" }, "ICDDiag24": { "sdtype": "categorical" }, "ICDDiag25": { "sdtype": "categorical" }, "ICDDiag26": { "sdtype": "categorical" }, "ICDDiag27": { "sdtype": "categorical" }, "ICDDiag28": { "sdtype": "categorical" }, "ICDDiag29": { "sdtype": "categorical" }, "ICDDiag30": { "sdtype": "categorical" }, "ProcCode": { "sdtype": "categorical" }, "POS": { "sdtype": "categorical" }, "MR_Allowed": { "sdtype": "numerical" }, "MR_Paid": { "sdtype": "numerical" } } }, "pharm": { "columns": { "Member_ID": { "sdtype": "id" }, "NDC": { "sdtype": "categorical" }, "ClaimID": { "sdtype": "unknown", "pii": true }, "FillDate": { "sdtype": "datetime" }, "ProviderID": { "sdtype": "categorical" }, "MR_Allowed": { "sdtype": "numerical" }, "MR_Paid": { "sdtype": "numerical" }, "Days_Supplied": { "sdtype": "numerical" }, "Qty_Dispensed": { "sdtype": "numerical" } } } }, "relationships": [ { "parent_table_name": "mem", "child_table_name": "med", "parent_primary_key": "Member_ID", "child_foreign_key": "Member_ID" }, { "parent_table_name": "mem", "child_table_name": "pharm", "parent_primary_key": "Member_ID", "child_foreign_key": "Member_ID" } ], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }
Hi @leeyuntien-milli, thank you. I realize you had already sent the metadata before so apologies for the confusion.
Unfortunately, I am not able to reproduce this issue. I am providing some next steps to unblock you asap.
The SDV is designed to only generate NaN/NaT values if it recognizes that NaN/NaT are possible in the real data.
I would strongly recommend running diagnostic report to see what's happening. We expect the score to be 100% (for more info the docs). What is the score for you?
from sdv.evaluation.multi_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=data,
synthetic_data=synthetic_data,
metadata=metadata)
Generating report ...
(1/3) Evaluating Data Validity: : 100%|██████████| 52/52 [00:00<00:00, 366.64it/s]
(2/3) Evaluating Data Structure: : 100%|██████████| 3/3 [00:00<00:00, 137.16it/s]
(3/3) Evaluating Relationship Validity: : 100%|██████████| 2/2 [00:00<00:00, 36.78it/s]
Overall Score: 100.0%
Properties:
- Data Validity: 100.0%
- Data Structure: 100.0%
- Relationship Validity: 100.0%
If it is 100%, it indicates that the SDV is working as intended. The problem may be in how the data is loaded into Python. Python may be reading in some values as NaN or NaT. Let me know what the score is and we can discuss next steps.
Using your metadata, I created some random test data. Modeling and sampling using HMA, I did not observe any NaN or NaT values. I have attached it here. Could you try it out?
Generating report ... (1/3) Evaluating Data Validity: : 100%|████████████████████████████████████████████████| 52/52 [00:10<00:00, 5.10it/s] (2/3) Evaluating Data Structure: : 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 192.03it/s] (3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████| 2/2 [00:01<00:00, 1.83it/s]
Overall Score: 97.56%
Properties:
using the test data there are still NaT's and NaN's so maybe there are some settings that are not set properly here
Hi @leeyuntien-milli, thanks for confirming.
Right, if the test data is also producing NaN/NaT values, I wonder if this is related to your Python environment or the way you're loading the data into Python. Could you please share the code you are using to read the data into Python? Along with anything you may be doing to modify that data once it's loaded into Python?
The recommended approach is to use the load_csvs
function, as specified in our docs:
from sdv.datasets.local import load_csvs
from sdv.multi_table import HMASynthesizer
# assume you have unzipped tests_data.zip
data = load_csvs(folder_name='test_data/')
# should you need to inspect it, the data is available under each file name
med_table = data['med']
pharm_table = data['pharm']
mem_table = data['mem']
# NO further modification of the data is necessary
# you can directly use it with SDV
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
Please refer to the code using your suggested function of load_csvs but the results are similar. The three tables in test_data is put under the folder data/.
from sdv.multi_table import HMASynthesizer
from sdv.metadata import MultiTableMetadata
from sdv.evaluation.multi_table import run_diagnostic
from sdv.datasets.local import load_csvs
all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)
for table_name in all_data.keys():
synthesizer.set_table_parameters(
table_name=table_name,
table_parameters={
'enforce_min_max_values': True,
'default_distribution': 'truncnorm'})
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()
diagnostic_report = run_diagnostic(
real_data=all_data,
synthetic_data=synthetic_data,
metadata=metadata)
Generating report ... (1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1683.87it/s] (2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s] (3/3) Evaluating Relationship Validity: : 100%|█████████████████████████████████████████| 2/2 [00:00<00:00, 128.01it/s]
Overall Score: 94.88%
Properties:
@leeyuntien-milli so using the same exact dataset and SDV version, your results are different than what we're seeing. Very interesting. This possibly indicates some issue with the version of other libraries or platform.
Could you provide more information about your setup? This includes:
pip freeze > requirements.txt
)Hi @leeyuntien-milli, thanks for the info. We realized that there is a key difference between my previous comment and the code you provided: In your code, you are using set_table_parameters
command to update the distribution to 'truncnorm'
. Is this intentional?
For SDV 1.10.0, you no longer need to update the distribution. It works for me if you remove this and just directly fit the synthesizer.
all_data = load_csvs(folder_name='data/')
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data = all_data)
synthesizer = HMASynthesizer(metadata)
# directly fit the data
# no need to update the synthesizer
synthesizer.fit(all_data)
synthetic_data = synthesizer.sample()
diagnostic_report = run_diagnostic(
real_data=all_data,
synthetic_data=synthetic_data,
metadata=metadata)
Let me know if that works. In the meantime, we will investigate why truncnorm
was causing it to create NaN values.
Thanks test_data passed so going to see if original datasets work.
Generating report ... (1/3) Evaluating Data Validity: : 100%|██████████████████████████████████████████████| 52/52 [00:00<00:00, 1662.86it/s] (2/3) Evaluating Data Structure: : 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s] (3/3) Evaluating Relationship Validity: : 100%|██████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]
Overall Score: 100.0%
Properties:
Fitting through the original datasets shows good results in terms of validity as well.
However we observe some issues which we hope can be resolved with some adjustments in package settings.
In med
usually FromDate, ToDate, PayDate and ICD codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.
In pharm
usually FillDate and NDC codes would vary for different claims with the same member or Member_ID, but seems not so in synthesized data.
Hi @leeyuntien-milli, thanks for the detailed response. In the interest of keeping our space clean, we usually we keep 1 GitHub issue open per technical problem. Since we were able to resolve the problem of NaNs (and this issue is getting pretty long), let me close this one. Let's use #1848 for this.
I'm filing this issue on behalf of a user.
Environment details
Problem description
We tried to do an HMA synthesizer on three tables
The tables are linked by one key Member_ID. However, when we generated synthesized data with 1% portion, relationships between dates and NDC and ICD codes do not seem to show up properly, from the screenshots for synthesized datasets. Can you advise how we might be able to improve it? Thanks.