sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 316 forks source link

Install dependencies only for the synthesizers I want to use #1621

Open vamsinilla opened 1 year ago

vamsinilla commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

i'm trying a sample application using sdv and deploying it in azure free trial. the artifact size is increased to 5 gb when installing all dependencies. how to reduce the size of the artifact where i'm using only gaussian coupla single table synthesizers

Steps to reproduce

<Replace this text with a description of the steps that anyone can follow to reproduce the error. If the error happens only on a specific dataset, please consider attaching some example data to the issue so that others can use it to reproduce the error.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 1 year ago

Hi @vamsinilla, nice to meet you and very good question. The SDV package itself is only a few KB so I expect that most of the bloat is coming from our external dependencies. I'm not 100% sure, but it seems to me that torch is probably what's causing the biggest bloat.

Unfortunately, there isn't a good way to isolate these dependencies within the SDV right now. For example, if you uninstall torch, then importing GaussianCopulaSynthesizer will crash even if it's not a strict dependency. (The import path will check to make sure all dependencies are there, not just those for Gaussian Copula.)

Next Steps

I propose we can turn this issue into a feature request for the ability to selectively install libraries for specific synthesizers only. To help us prioritize, I would love to hear more about what you are working on. Why are you running into the 5gb limit?

Technical Details

I tried to uninstall torch and then import the GaussianCopulaSynthesizers.

%pip uninstall torch -y
from sdv.single_table import GaussianCopulaSynthesizer

This led to a ModuleNotFoundError, traceback found below.

stack_trace.txt

npatki commented 1 year ago

Changing this to a feature request and updating the title to clarify.