Open selfcontrol7 opened 3 years ago
By looking at the error msg, it seems to me that fp16 is on but GPU is not in used.
Hi, thank you for your reply.
Here is the code I am running after installing all the requirements.
!CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
--task Structured/Beer \
--batch_size 64 \
--max_len 64 \
--lr 3e-5 \
--n_epochs 40 \
--finetuning \
--lm distilbert \
--fp16 \
--da del \
--dk product \
--summarize
I am running the notebook on Colab. The runtime type is set as None. So I don't think GPU is used. Does it mean I must use the GPU setting instead?
I just tried the code again using GPU setting and the lines import nltk nltk.download('stopwords')
before the training part and it solved the issue.
Thank you @oi02lyl .
If needed I can share the
Now after the training part was done, I tried to run the matching code as follow:
!CUDA_VISIBLE_DEVICES=0 python matcher.py \
--task wdc_all_small \
--input_path input/input_small.jsonl \
--output_path output/output_small.jsonl \
--lm distilbert \
--use_gpu \
--fp16 \
--checkpoint_path checkpoints/
but It seems that the model can not be found:
2020-11-03 05:51:50.890227: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "matcher.py", line 212, in <module>
hp.lm, hp.use_gpu, hp.fp16)
File "matcher.py", line 170, in load_model
raise ModelNotFoundError(checkpoint)
ditto.exceptions.ModelNotFoundError: Model checkpoints/wdc_all_small.pt was not found
I also try the notebook mentioned in #9 here, but the same error appears.
Please, any help?
You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH
Hi, thank you again for your reply. I will try the updated notebook and come back to you ASAP.
Thanks.
You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH
Hello,
I tried the given updated notebook but the warning below is shown when running the matcher. Please, can you guide me in solving it?
Thank you.
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py:25: UserWarning: An input tensor was not cuda.
warnings.warn("An input tensor was not cuda.")
4398it [00:07, 573.10it/s]
I see. This is because we install only the python version of apex. More details here: https://github.com/NVIDIA/apex#linux. I think the warning is safe to ignore in this case. You can also install the version with CUDA and C++ extensions following their instructions.
You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH
@oi02lyl I am having similar issues with the checkpoint not being found. I tried to use the link you posted however it states I do not have the correct credentials to see the file. Do you mind trying to open the notebook to public viewing? Thanks.
this link don't works for me! I get this: There was an error loading this notebook. Ensure that the file is accessible and try again. Invalid Credentials https://drive.google.com/drive/?action=locate&id=1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2&authuser=3 please help me.
Hello,
I am trying to run the training code but I come to this error:
Please, can you guide me to solve this issue?