Closed pengzhangzhi closed 1 year ago
Could I know what the bugs are? The environment should be the same as score_sde.
I create a new conda env and use the requirements.txt to install pkgs. The installation went well.
I followed your README, run python ./main.py --config ./configs/rectified_flow/cifar10_rf_gaussian_ddpmpp.py --eval_folder eval --mode eval --workdir ./logs/1_rectified_flow --config.eval.enable_sampling --config.eval.batch_size 1024 --config.eval.num_samples 50000 --config.eval.begin_ckpt 2
and then got the following error
raise ImportError(
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.11; Detected an installation of version 2.4.0. Please upgrade TensorFlow to proceed.
I tried to install a higher version of TensorFlow by pip install tensorflow==2.11
and re-run the command.
another error
File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/numpy/__init__.py", line 320, in __getattr__
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'typeDict'
By this https://stackoverflow.com/questions/74852225/attributeerror-module-numpy-has-no-attribute-typedict, I tried pip install numpy==1.21
to resolve this error.
Another error
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
Traceback (most recent call last):
File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/user/taosheng/anaconda3/envs/flow/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
RuntimeError: Error building extension 'fused': [1/3] c++ -MMD -MF fused_bias_act.o.d -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/TH -isystem /user/taosheng/anaconda3/envs/flow/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /user/taosheng/anaconda3/envs/flow/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/fused_bias_act.cpp -o fused_bias_act.o
FAILED: fused_bias_act.o
The above exception was the direct cause of the following exception:
Seem like the torch compilation has some problems. This error might be caused by the torch or cuda version. But since you have specified the version, I am afraid of changing it.
Even though I can't run your code, I can use torch in the terminal. It's weird!
>>> torch.cuda.is_available()
True
>>> torch.tensor([1])
tensor([1])
>>>
Oh yes, that's the same issue I met when I try score_sde code...I managed to fix it by my self, but I will upload my yaml to the repo for your convenience. Thank you for posting your issue.
Please check if the yaml works.
Hi. I tried the yaml and it produces many pkg errors for example
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement python-graphviz==0.20.1 (from versions: none)
ERROR: No matching distribution found for python-graphviz==0.20.1
Directly exported pkgs are somehow unable to be installed. Would you like to give a minimal yml file to reproduce code only for this repo?
I could not maintain multiple conda environments due to space limitation on my university server....maybe refer to the tensorflow versions, numpy versions, etc. in the yml file and adjust your environment?
Ok. Thank you for the yaml file. I will contribute a pure and clean env file once I resolve all the problems :)
I have faced up with the same problems using the env.yaml file... Oh Gosh!
Is it working now? FYI, my conda environment is: tensorflow==2.9.0, tensorflow-probability==0.12.2, torch==1.11.0, torchvision==0.8.0a0, numpy==1.21.6. Try to install torch first and check if cuda works with torch. Then install tensorflow. Then tensorflow-probability.
Nope. Same problem with torch. But I have follow your guidance to install the same version of torch.
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
raise RuntimeError(message) from e │···················
RuntimeError: Error building extension 'fused': [1/3] c++ -MMD -MF fused_bias_act.o.d -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_li│···················
bstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/include -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/t│···················
orch/include/torch/csrc/api/include -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/include/TH -isystem /user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/inc│···················
lude/THC -isystem /usr/local/cuda/include -isystem /user/taosheng/anaconda3/envs/sde/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /user
However I can use torch in the terminal. Only when I run your training cmd, it fails and produce the error as before.
>>> import torch │···················
torc>>> torch.cuda.is_available() │···················
True │···················
>>>
I tried to fix the version problem of torch. It seems work but I got another error.
Traceback (most recent call last):
File "./main.py", line 18, in <module>
import run_lib
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/run_lib.py", line 29, in <module>
from models import ddpm, ncsnv2, ncsnpp
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/ncsnpp.py", line 18, in <module>
from . import utils, layers, layerspp, normalization
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/layerspp.py", line 20, in <module>
from . import up_or_down_sampling
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/models/up_or_down_sampling.py", line 10, in <module>
from op import upfirdn2d
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/__init__.py", line 1, in <module>
from .fused_act import FusedLeakyReLU, fused_leaky_relu
File "/user/taosheng/pzz/github/RectifiedFlow/ImageGeneration/op/fused_act.py", line 15, in <module>
os.path.join(module_path, "fused_bias_act_kernel.cu"),
File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1156, in load
keep_intermediates=keep_intermediates)
File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1382, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/user/taosheng/anaconda3/envs/sde/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
ImportError: /user/taosheng/.cache/torch_extensions/py37_cu102/fused/fused.so: cannot open shared object file: No such file or directory
I notice you have some ops and they are not compiled.
Maybe I need to compile them to move on?
Yes, they should be compiled. Maybe delete the pychace folder in ./op and ./model first? I forgot to remove these trash folders. Then you can try to compile again?
how to compile?
run the code and they will be automatically compiled?
same problem.
Traceback (most recent call last):
File "./main.py", line 18, in
Could you try delete the current repo, remove your current conda environment, then re-git clone the repo to your server and re-set up your conda environment? Also ,there are related issues around the Internet, e.g., https://github.com/rosinality/stylegan2-pytorch/issues/5
Ok. Why it is gonna help? I tried, it won't work.. The issue also have no good solution
I don't see code for compiling these ops? Usually they are in setup.py..
I mean, maybe you can google solutions, because I did not meet this problem...I saw people saying that it is related to your cuda version? This op folder is inherited from https://github.com/yang-song/score_sde_pytorch, there is no setup.py either. When I set up his environment, I did have some problems with tensorflow version, but I did not have compilation issues...
Thanks. I give up. Let me know if someone else can successfully reproduce your code.
We have updated the dependencies section and tested it. Could you remove all the previous caches, re-clone the repo, and have a try again? Thanks!
Thank you! It works! I now have a problem with the tensorflow. I tried to run your evaluation. Due to network problem, I can't download inception.
I0214 02:48:18.012538 140223850459840 resolver.py:416] Downloading TF-Hub Module 'https://tfhub.dev/tensorflow/tfgan/eval/inception/1'.
Traceback (most recent call last):
File "/opt/anaconda3/envs/sde/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 1443, in connect
super().connect()
File "/opt/anaconda3/envs/sde/lib/python3.7/http/client.py", line 948, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/opt/anaconda3/envs/sde/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/opt/anaconda3/envs/sde/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
I manually download, extract and save it to /tmp/tfhub_modules/
ls /tmp/tfhub_modules/
saved_model.pb tfgan_eval_inception_1.tar.gz variables
However, I still having this problem...
This is very tricky.... The only solution I know is to use a server outside the Great Firewall...
Hi. Do you have an env.yaml file for us to reproduce your environment? I tried to install the required pkgs using the requirements.txt provided in this repo, but many bugs occurred. A yaml file to create a new conda env would be much straigh-forward and easy for those who want to run your code :) Or any other alternatives would be great as long as it makes reproducing environment easier!