ParSynthesizer trying to allocate an absurd about of memory for a small dataset

JonathanBhimani-Burrows commented 6 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.12.1
Python version: 3.10
Operating System: Google Colab

Error Description

I'm trying to use the parsynthesizer to create synthetic data to improve model performance. However, it seems to have some serious issues when trying to fit the data. I might be missing something, but here is the setup

{ "columns": { "User_Lookup_Id": { "sdtype": "id" }, "Revenue_Date": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d" }, "Revenue_Amount": { "sdtype": "numerical" }, "User_First_Name": { "pii": false, "sdtype": "first_name" }, "Gender": { "sdtype": "categorical" }, "Address_City": { "sdtype": "categorical" }, "Primary_Address": { "sdtype": "categorical" }, "Average_Income": { "sdtype": "numerical" }, "Social_Group_Name": { "sdtype": "categorical" }, "Spouse": { "sdtype": "categorical" }, "Active_Email": { "pii": false, "sdtype": "email" }, "dummy_income": { "sdtype": "numerical", "computer_representation": "Float" } },

context columns are ['User_First_Name', 'Address_City','Gender','Primary_Address','Social_Group_Name','Spouse_Is_Active','Active_Email', 'dummy_income']

synthesizer = PARSynthesizer(metadata, verbose=True,context_columns=context_cols, enforce_min_max_values =True)

When I try to run the model, it tries to allocate 451 GB GPU for 168k rows, which is absurd Segment size seems alleviate this, but it limits the model to only producing segments of len segment_size, which is problematic if you have sequences > segment size (I can't upload the data for confidentiality reasons unfortunately)

Is there something I'm missing? Is this expected behavior? Cause if so, some optimization is necessary, as 168k rows is not very much to train the model (full dataset is 28 mil). Is there not a batch size parameter that could be configured for this model?

Thanks for your help

Steps to reproduce

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

npatki commented 6 months ago

Hi @JonathanBhimani-Burrows thanks for reaching out, and from sharing your metadata. One thing I notice from your metadata is that some of the columns are marked as categorical or with PII as False. To run SDV well, I think the metadata should be updated.

Brief description of how SDV works:

There are some attributes (columns) that SDV will use to learn important patterns, correlations etc. Usually these are attributes that are statistical in nature, such as dollar amounts, dates, discrete categories, etc.
There are other attributes (eg. names, addresses, etc.) that probably don't make sense for learning patterns, as they usually are private (PII) and they don't contain statistical information. The SDV will anonymize such attributes if you mark them with the correct sdtype and PII as True.

Changes I would make to your metadata: The following columns should probably be anonymized --

First name and email should have pii set to True to anonymize them instead of trying to learn patterns within them
Primary address should be sdtype 'address' or 'street_address' with pii also set to True.

Resources:

JonathanBhimani-Burrows commented 5 months ago

Thanks for the reply, but this didn't really answer the question The decision of making both first name and primary address with pii = False, was a design decision. I want both of those to be used to help determine whatever output I need Having said that, back to the original discussion, is this expected behavior? Does the model genuinely use up an enormous amount of VRAM to instantiate? Is there no option for batch sizes?

npatki commented 5 months ago

Hi @JonathanBhimani-Burrows you are welcome.

is this expected behavior

I cannot give you an answer without learning more. Metadata is intrinsically related to performance, and I’ve seen multiple cases where a PII/categorical mixup has led to issues. I appreciate you sharing that these columns are meant to be categorical, something I’m curious to know more about.

I suspect that the columns you’ve marked as categorical may be high cardinality, which is known to cause issues (expected). If you are willing to entertain an experiment, updating them to PII like I previously mentioned will help verify (or rule out) this guess — regardless of what your intended usage may be.

Does the model genuinely use up an enormous amount of VRAM to instantiate?

My experience is that PAR is usable with a local machine’s RAM in many cases, especially when it's set up with PII columns and a sequence_index.

But we are aware of some performance issues popping up over time in #1965, and we are working to uncover root cause(s). I see you’ve also replied there.

Is there no option for batch sizes?

All available parameters are documented in our website. We keep our docs up-to-date so you are already referencing those, then you are at the right place!

Just based on the algorithm definition of PAR, batching within a sequence is not trivial. For more info, see Preprint.

npatki commented 4 months ago

Hello, are you still working on this project? This issue hasn't been updated in a month, and as it is an offshoot of #1965 I will close it in favor of that one. Please do feel free to reply if there is more to discuss, or if you were able to experiment with calling it PII for the sake of measuring performance.

sdv-dev / SDV