reproduce environment - Githubissues

pengzhangzhi commented 1 year ago

Hi. Do you have an env.yaml file for us to reproduce your environment? I tried to install the required pkgs using the requirements.txt provided in this repo, but many bugs occurred. A yaml file to create a new conda env would be much straigh-forward and easy for those who want to run your code :) Or any other alternatives would be great as long as it makes reproducing environment easier!

gnobitab commented 1 year ago

Could I know what the bugs are? The environment should be the same as score_sde.

pengzhangzhi commented 1 year ago

I create a new conda env and use the requirements.txt to install pkgs. The installation went well.

I followed your README, run python ./main.py --config ./configs/rectified_flow/cifar10_rf_gaussian_ddpmpp.py --eval_folder eval --mode eval --workdir ./logs/1_rectified_flow --config.eval.enable_sampling --config.eval.batch_size 1024 --config.eval.num_samples 50000 --config.eval.begin_ckpt 2 and then got the following error

    raise ImportError(
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.11; Detected an installation of version 2.4.0. Please upgrade TensorFlow to proceed.

I tried to install a higher version of TensorFlow by pip install tensorflow==2.11 and re-run the command.

another error

File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/numpy/__init__.py", line 320, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'typeDict'

By this https://stackoverflow.com/questions/74852225/attributeerror-module-numpy-has-no-attribute-typedict, I tried pip install numpy==1.21 to resolve this error.

Another error

  warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
Traceback (most recent call last):
  File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
    subprocess.run(
  File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

RuntimeError: Error building extension 'fused': [1/3] c++ -MMD -MF fused_bias_act.o.d -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/TH -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /user/taosheng/anaconda3/envs/flow/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/fused_bias_act.cpp -o fused_bias_act.o 
FAILED: fused_bias_act.o 

The above exception was the direct cause of the following exception:

Seem like the torch compilation has some problems. This error might be caused by the torch or cuda version. But since you have specified the version, I am afraid of changing it.

pengzhangzhi commented 1 year ago

Even though I can't run your code, I can use torch in the terminal. It's weird!


>>> torch.cuda.is_available()
True
>>> torch.tensor([1])
tensor([1])
>>>

gnobitab commented 1 year ago

Oh yes, that's the same issue I met when I try score_sde code...I managed to fix it by my self, but I will upload my yaml to the repo for your convenience. Thank you for posting your issue.

gnobitab commented 1 year ago

Please check if the yaml works.

pengzhangzhi commented 1 year ago

Hi. I tried the yaml and it produces many pkg errors for example

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement python-graphviz==0.20.1 (from versions: none)
ERROR: No matching distribution found for python-graphviz==0.20.1

Directly exported pkgs are somehow unable to be installed. Would you like to give a minimal yml file to reproduce code only for this repo?

gnobitab commented 1 year ago

I could not maintain multiple conda environments due to space limitation on my university server....maybe refer to the tensorflow versions, numpy versions, etc. in the yml file and adjust your environment?

pengzhangzhi commented 1 year ago

Ok. Thank you for the yaml file. I will contribute a pure and clean env file once I resolve all the problems :)

pengzhangzhi commented 1 year ago

I have faced up with the same problems using the env.yaml file... Oh Gosh!

gnobitab commented 1 year ago

Is it working now? FYI, my conda environment is: tensorflow==2.9.0, tensorflow-probability==0.12.2, torch==1.11.0, torchvision==0.8.0a0, numpy==1.21.6. Try to install torch first and check if cuda works with torch. Then install tensorflow. Then tensorflow-probability.

pengzhangzhi commented 1 year ago

Nope. Same problem with torch. But I have follow your guidance to install the same version of torch.

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.    

   raise RuntimeError(message) from e                                                                                                                                                                      │···················
RuntimeError: Error building extension 'fused': [1/3] c++ -MMD -MF fused_bias_act.o.d -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_li│···················
bstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/include -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/t│···················
orch/include/torch/csrc/api/include -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/include/TH -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/inc│···················
lude/THC -isystem /usr/local/cuda/include -isystem /user/taosheng/anaconda3/envs/sde/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /user

However I can use torch in the terminal. Only when I run your training cmd, it fails and produce the error as before.

>>> import torch                                                                                                                                                                                            │···················
torc>>> torch.cuda.is_available()                                                                                                                                                                           │···················
True                                                                                                                                                                                                        │···················
>>>

pengzhangzhi commented 1 year ago

I tried to fix the version problem of torch. It seems work but I got another error.

Traceback (most recent call last):
  File "./main.py", line 18, in <module>
    import run_lib
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/run_lib.py", line 29, in <module>
    from models import ddpm, ncsnv2, ncsnpp
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/ncsnpp.py", line 18, in <module>
    from . import utils, layers, layerspp, normalization
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/layerspp.py", line 20, in <module>
    from . import up_or_down_sampling
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/up_or_down_sampling.py", line 10, in <module>
    from op import upfirdn2d
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/__init__.py", line 1, in <module>
    from .fused_act import FusedLeakyReLU, fused_leaky_relu
  File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/fused_act.py", line 15, in <module>
    os.path.join(module_path, "fused_bias_act_kernel.cu"),
  File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1156, in load
    keep_intermediates=keep_intermediates)
  File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1382, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
ImportError: /user/taosheng/.cache/torch_extensions/py37_cu102/fused/fused.so: cannot open shared object file: No such file or directory

I notice you have some ops and they are not compiled. Maybe I need to compile them to move on?

gnobitab commented 1 year ago

Yes, they should be compiled. Maybe delete the pychace folder in ./op and ./model first? I forgot to remove these trash folders. Then you can try to compile again?

pengzhangzhi commented 1 year ago

how to compile?

gnobitab commented 1 year ago

run the code and they will be automatically compiled?

pengzhangzhi commented 1 year ago

same problem. Traceback (most recent call last): File "./main.py", line 18, in import run_lib File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/run_lib.py", line 29, in from models import ddpm, ncsnv2, ncsnpp File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/ncsnpp.py", line 18, in from . import utils, layers, layerspp, normalization File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/layerspp.py", line 20, in from . import up_or_down_sampling File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/up_or_down_sampling.py", line 10, in from op import upfirdn2d File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/init.py", line 1, in from .fused_act import FusedLeakyReLU, fused_leaky_relu File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/fused_act.py", line 15, in os.path.join(module_path, "fused_bias_act_kernel.cu"), File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1156, in load keep_intermediates=keep_intermediates) File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1382, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library module = importlib.util.module_from_spec(spec) ImportError: /user/taosheng/.cache/torch_extensions/py37_cu102/fused/fused.so: cannot open shared object file: No such file or directory

gnobitab commented 1 year ago

Could you try delete the current repo, remove your current conda environment, then re-git clone the repo to your server and re-set up your conda environment? Also ,there are related issues around the Internet, e.g., https://github.com/rosinality/stylegan2-pytorch/issues/5

pengzhangzhi commented 1 year ago

Ok. Why it is gonna help? I tried, it won't work.. The issue also have no good solution

pengzhangzhi commented 1 year ago

I don't see code for compiling these ops? Usually they are in setup.py..

gnobitab commented 1 year ago

I mean, maybe you can google solutions, because I did not meet this problem...I saw people saying that it is related to your cuda version? This op folder is inherited from https://github.com/yang-song/score_sde_pytorch, there is no setup.py either. When I set up his environment, I did have some problems with tensorflow version, but I did not have compilation issues...

pengzhangzhi commented 1 year ago

Thanks. I give up. Let me know if someone else can successfully reproduce your code.

gnobitab commented 1 year ago

We have updated the dependencies section and tested it. Could you remove all the previous caches, re-clone the repo, and have a try again? Thanks!

pengzhangzhi commented 1 year ago

Thank you! It works! I now have a problem with the tensorflow. I tried to run your evaluation. Due to network problem, I can't download inception.

I0214 02:48:18.012538 140223850459840 resolver.py:416] Downloading TF-Hub Module 'https://tfhub.dev/tensorflow/tfgan/eval/inception/1'.
Traceback (most recent call last):
  File "/opt/anaconda3/envs/sde/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1443, in connect
    super().connect()
  File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/opt/anaconda3/envs/sde/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/opt/anaconda3/envs/sde/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

I manually download, extract and save it to /tmp/tfhub_modules/

ls /tmp/tfhub_modules/
saved_model.pb  tfgan_eval_inception_1.tar.gz  variables

However, I still having this problem...

gnobitab commented 1 year ago

This is very tricky.... The only solution I know is to use a server outside the Great Firewall...

gnobitab / RectifiedFlow

reproduce environment #1