Open AryanSaeedi opened 10 months ago
Never mind, the process starts after a minute of showing those warnings. however, I can't seem to sample I get an error. Could you please let me know what could be the problem?
If I set the timeout to False, it starts generating new samples but it takes a lot of time. I want to generate around 41000 instances and it shows that it will take more than two days.
Hi,
I haven't worked on this for a couple of years. But I will try to help you.
Firstly, were you able to make to run the examples? If no, then it already means that there's an issue while setting up the library. If you could make it work, then it might come from the data and/or the definition of the DAG for the variables.
advise
function?In the end, I think that the model is discarding too many samples during the sampling phase. You can set the parameter randomize
to False
to avoid discarding. But it will most likely give you bad results.
Last piece of advice, I saw that you're training the model on your CPU for 100 epochs. I would strongly advise to use a GPU to speed up the training process and train it for more than 100 epochs.
Hi, thank you for the response. I didn't run the examples at first, but after I got your response I tried running the examples but got an error with the version of Numpy while trying to sample. I have tried downgrading Numpy and changing the bit of the code inv(yin) of pynverse package, but I couldn't solve the problem and rather created more. The first error I get is with the OneHotEncolder at the synthesizer.py line 113: _self.onehot = OneHotEncoder(categories=[np.array(self.varorder)], sparse=False) saying that sparse=False has been changed in the newer version of scikitlearn to _sparseoutput. After fixing this, the error that I get is when trying to run the sample command of the example file. Below is the issue that I get. I am not really sure which version of Numpy you were using at the time. I did some deductions based on when you created the repo and downgraded the Numpy version and other related libraries for compatibility issues, my problems were getting worse.
To answer your questions: I didn't define the DAG since I saw I don't really have. For the continuous variables, yes I did define them correctly and compared it with the example. Now, I am working with a GPU and the example file only takes 16 mins to fit.
I haven't worked on this project for more than 2 years now. So, it's a bit out of date. Sorry about that. Anyway, I managed to run the example using the following procedure:
datgan
and the jupyter
modules via pipprotobuf
package via pip install protobuf==3.20.0
encoded_data.pkl
in the folder example/data/encoded_data
After all these steps, I was able to run the full notebook training.ipynb
. I haven't checked the quality of the results since I ran it on my laptop and I don't have a GPU (just trained 2 epochs for each model)
Package Version ---------------------------- ------------------- absl-py 2.1.0 anyio 3.7.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 astunparse 1.6.3 attrs 23.2.0 backcall 0.2.0 beautifulsoup4 4.12.3 bleach 6.0.0 cachetools 5.3.2 certifi 2023.11.17 cffi 1.15.1 charset-normalizer 3.3.2 comm 0.1.4 cycler 0.11.0 datgan 2.1.10 debugpy 1.7.0 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 entrypoints 0.4 exceptiongroup 1.2.0 fastjsonschema 2.19.1 flatbuffers 23.5.26 fonttools 4.38.0 gast 0.5.4 google-auth 2.27.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.60.0 h5py 3.8.0 idna 3.6 importlib-metadata 6.7.0 importlib-resources 5.12.0 ipykernel 6.16.2 ipython 7.33.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 jedi 0.19.1 Jinja2 3.1.3 joblib 1.3.2 jsonschema 4.17.3 jupyter 1.0.0 jupyter_client 7.4.9 jupyter-console 6.6.3 jupyter_core 4.12.0 jupyter-server 1.24.0 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.9 keras 2.8.0 Keras-Preprocessing 1.1.2 kiwisolver 1.4.5 libclang 16.0.6 lightgbm 4.3.0 Markdown 3.4.4 MarkupSafe 2.1.4 matplotlib 3.5.3 matplotlib-inline 0.1.6 mistune 3.0.2 nbclassic 1.0.0 nbclient 0.7.4 nbconvert 7.6.0 nbformat 5.8.0 nest-asyncio 1.6.0 networkx 2.6.3 notebook 6.5.6 notebook_shim 0.2.3 numpy 1.21.6 oauthlib 3.2.2 opt-einsum 3.3.0 packaging 23.2 pandas 1.3.5 pandocfilters 1.5.1 parso 0.8.3 pexpect 4.9.0 pickleshare 0.7.5 Pillow 9.5.0 pip 22.3.1 pkgutil_resolve_name 1.3.10 prometheus-client 0.17.1 prompt-toolkit 3.0.42 protobuf 3.20.0 psutil 5.9.8 ptyprocess 0.7.0 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycparser 2.21 Pygments 2.17.2 pynverse 0.1.4.6 pyparsing 3.1.1 pyrsistent 0.19.3 python-dateutil 2.8.2 pytz 2023.4 pyzmq 24.0.1 qtconsole 5.4.4 QtPy 2.4.1 requests 2.31.0 requests-oauthlib 1.3.1 rsa 4.9 scikit-learn 1.0.2 scipy 1.7.3 Send2Trash 1.8.2 setuptools 65.6.3 six 1.16.0 sniffio 1.3.0 soupsieve 2.4.1 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.8.0 tensorflow-io-gcs-filesystem 0.34.0 termcolor 2.3.0 terminado 0.17.1 tf-estimator-nightly 2.8.0.dev2021122109 threadpoolctl 3.1.0 tinycss2 1.2.1 tornado 6.2 tqdm 4.66.1 traitlets 5.9.0 typing_extensions 4.7.1 urllib3 2.0.7 wcwidth 0.1.9 webencodings 0.5.1 websocket-client 1.6.1 Werkzeug 2.2.3 wheel 0.38.4 widgetsnbextension 4.0.9 wrapt 1.16.0 zipp 3.15.0
Thank you very much know the example is working. However, when I try to run my dataset I get an error. Do you think it has anything to do with not creating the DAG? I am not initializing a DAG.
Thank you for the previous responses. :)
Update: I got it working, I think not initializing the DAG runs into problems. The model now kind of starts training but it doesn't show how much time it would take. I am using a 20GB Nvidia GPU, and it consumes the whole thing. As for the DAG, I have created it so that all of the other features depend on one, sort of like a hedgehog where the hedgehog itself is one feature and the spikes are the rest of the features.
I don't know how many variables you have in your dataset. But it seems a bit weird that it takes so much time for so few epochs. I think you should compare with the example to analyze why you have such discrepancies. It's a bit difficult to know what's going on with your data since I don't know anything about it.
Thank you very much for the response, I have been busy with my thesis. I somehow managed to get to run, but I still stumbled upon the first issue I had.
Putting the randomize
parameter to False
leads to a KeyError 'index', I am not sure if the sample
method generates an index or if the original data should have one already.
I totally forgot to mention, I am using the CSE-CIC-IDS2018 dataset. It is a network traffic flow dataset and has 80 features with 79 being numeric and only one discrete. The number of instances differs depending on what you want to do, but in the one above it is around 41000 instances.
Hi, Thank you for sharing the repo. I am writing my master's thesis and I am trying to use your model to generate synthetic network traffic. However, I am facing a problem with the TensorFlow placeholder while trying to fit the model. The error message keeps repeating itself over and over again. Do you have any suggestions on how to maybe fix it? While it does seem like a warning, it keeps on repeating. Thank you, Aryan