simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Run:python3 train_retrieval.py -c config/retrieval/paper2020/anet_coot.yaml #14

Closed haoshuai714 closed 3 years ago

haoshuai714 commented 3 years ago

Run:python3 train_retrieval.py -c config/retrieval/paper2020/anet_coot.yaml Error: Traceback (most recent call last): File "train_retrieval.py", line 6, in from coot.configs_retrieval import ExperimentTypesConst, RetrievalConfig as Config File "/data2/haoxiaoshuai/new_coot/coot/configs_retrieval.py", line 11, in from nntrainer import data as nn_data, lr_scheduler, models, optimization, trainer_configs, typext, utils File "/data2/haoxiaoshuai/new_coot/nntrainer/models/init.py", line 4, in from nntrainer.initialization import init_network, initweight File "/data2/haoxiaoshuai/new_coot/nntrainer/initialization.py", line 7, in from nntrainer import utils_torch, typext, utils File "", line 1 (cudnn.benchmark=) ^ SyntaxError: invalid syntax

simon-ging commented 3 years ago

Hello,

I cannot reproduce your problem and you did not fill out the issue template so I have no idea about your setup (OS, python version, etc.).

I suggest: Install miniconda, setup conda environment with python=3.8, install pytorch, install requirements, then try again.

simon-ging commented 3 years ago

Going to close this for now since it's probably nothing to do on our end.

carlamao commented 3 years ago

Hi, I also have this problem and i have a miniconda and conda env with python 3.8

simon-ging commented 3 years ago

Hi,

Please run python -V to make sure your env is active.

Fill out all fields in the following issue template and I will take a look at your problem:


Describe the bug A clear and concise description of what the bug is. (Include full exception stack)

To Reproduce Steps to reproduce the behavior.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

System Info:

OS: [e.g. Ubuntu 18.04]
Python version [e.g. 3.8.6]
PyTorch version [e.g. 1.7.0+cu11]

Additional context Add any other context about the problem here.

carlamao commented 3 years ago

Describe the bug

the line python3 train_retrieval.py -c config/retrieval/paper2020/anet_coot.yaml gives the following problem:

Traceback (most recent call last): File "train_retrieval.py", line 6, in from coot.configs_retrieval import ExperimentTypesConst, RetrievalConfig as Config File "/home/MSAI/carlairi001/carla/coot-videotext/coot/configs_retrieval.py", line 11, in from nntrainer import data as nn_data, lr_scheduler, models, optimization, trainer_configs, typext, utils File "/home/MSAI/carlairi001/carla/coot-videotext/nntrainer/models/init.py", line 4, in from nntrainer.initialization import init_network, initweight File "/home/MSAI/carlairi001/carla/coot-videotext/nntrainer/initialization.py", line 7, in from nntrainer import utils_torch, typext, utils File "", line 1 (cudnn.benchmark=) ^ SyntaxError: invalid syntax

To Reproduce

I connect to the GPU. I use a miniconda3 environment with python 3.8.5, I have pytorch and all the libraries installed as well as the requirements. I go the the coot-videotext directory. I then use a SLURM job script to run the following command:

python3 train_retrieval.py -c config/retrieval/paper2020/anet_coot.yaml

Expected behavior

The program starts the training of the model

Screenshots This is the exception stack

Screenshot 2021-02-17 at 11 08 57

This is my python version

Screenshot 2021-02-17 at 11 10 13

System Info: OS: CentOS Linux Python Version: 3.8.5 Pytorch version: 1.7.1

simon-ging commented 3 years ago

Hi, looks like you are accidentally using some old python version inside your job that doesn't understand f-strings.

Your slurm job runs "python3..." while your other command is "python -V"... without the 3.

Probably you have to setup your miniconda environment inside the slurm job.

To test this, add "python -V" and "python3 -V" to your slurm job script, run it and check the logs, you should see the wrong version.

Adding something like "conda activate base" to your jobscript may solve the problem.

carlamao commented 3 years ago

Hi, thank you it was indeed the environment problem in the job.sh.

However, I am encountering this issue when trying to run the following command on CPU: python train_retrieval.py -c config/retrieval/paper2020/anet_coot.yaml --load_model provided_models/anet_coot.pth --validate

Traceback (most recent call last): File "train_retrieval.py", line 95, in main() File "train_retrieval.py", line 76, in main trainer = Trainer( File "/Users/carlairismao/Desktop/coot-videotext/coot/trainer_retrieval.py", line 119, in init self.hook_post_init() File "/Users/carlairismao/Desktop/coot-videotext/nntrainer/trainer_base.py", line 368, in hook_post_init model_state = th.load(str(self.load_model)) File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 774, in _legacy_load result = unpickler.load() File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 730, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize device = validate_cuda_device(location) File "/opt/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

simon-ging commented 3 years ago

Hi,

CPU-only is untested. Try the following:

Add --no_cuda

If that doesn't work, go to nntrainer/trainer_base.py function hook_post_init and change the line model_state = th.load(str(self.load_model)) to model_state = th.load(str(self.load_model), map_location=torch.device('cpu')) as it says in the error message.