Code implementation - Training

selfcontrol7 commented 3 years ago

Hello,

I am trying to run the training code but I come to this error:

2020-11-02 07:36:08.658676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 232k/232k [00:00<00:00, 318kB/s]
Downloading: 100% 442/442 [00:00<00:00, 284kB/s]
Downloading: 100% 268M/268M [00:04<00:00, 62.7MB/s]
Traceback (most recent call last):
  File "train_ditto.py", line 103, in <module>
    run_tag)
  File "Snippext_public/snippext/mixda.py", line 253, in initialize_and_train
    alpha_aug=hp.alpha_aug)
  File "Snippext_public/snippext/mixda.py", line 152, in train
    with amp.scale_loss(loss, optimizer) as scaled_loss:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 82, in scale_loss
    raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  "
RuntimeError: Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before `with amp.scale_loss`.

Please, can you guide me to solve this issue?

oi02lyl commented 3 years ago

Can you share the command that you used to run the code?
Are you using GPU or CPU?

By looking at the error msg, it seems to me that fp16 is on but GPU is not in used.

selfcontrol7 commented 3 years ago

Hi, thank you for your reply.

Here is the code I am running after installing all the requirements.

!CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
--task Structured/Beer \
--batch_size 64 \
--max_len 64 \
--lr 3e-5 \
--n_epochs 40 \
--finetuning \
--lm distilbert \
--fp16 \
--da del \
--dk product \
--summarize

I am running the notebook on Colab. The runtime type is set as None. So I don't think GPU is used. Does it mean I must use the GPU setting instead?

I just tried the code again using GPU setting and the lines import nltk nltk.download('stopwords')before the training part and it solved the issue.

Thank you @oi02lyl .

If needed I can share the

selfcontrol7 commented 3 years ago

Now after the training part was done, I tried to run the matching code as follow:

!CUDA_VISIBLE_DEVICES=0 python matcher.py \
  --task wdc_all_small \
  --input_path input/input_small.jsonl \
  --output_path output/output_small.jsonl \
  --lm distilbert \
  --use_gpu \
  --fp16 \
  --checkpoint_path checkpoints/

but It seems that the model can not be found:

2020-11-03 05:51:50.890227: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "matcher.py", line 212, in <module>
    hp.lm, hp.use_gpu, hp.fp16)
  File "matcher.py", line 170, in load_model
    raise ModelNotFoundError(checkpoint)
ditto.exceptions.ModelNotFoundError: Model checkpoints/wdc_all_small.pt was not found

I also try the notebook mentioned in #9 here, but the same error appears.

Please, any help?

oi02lyl commented 3 years ago

You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

selfcontrol7 commented 3 years ago

Hi, thank you again for your reply. I will try the updated notebook and come back to you ASAP.

Thanks.

selfcontrol7 commented 3 years ago

You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

Hello,

I tried the given updated notebook but the warning below is shown when running the matcher. Please, can you guide me in solving it?

Thank you.

Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py:25: UserWarning: An input tensor was not cuda.
  warnings.warn("An input tensor was not cuda.")
4398it [00:07, 573.10it/s]

oi02lyl commented 3 years ago

I see. This is because we install only the python version of apex. More details here: https://github.com/NVIDIA/apex#linux. I think the warning is safe to ignore in this case. You can also install the version with CUDA and C++ extensions following their instructions.

braswent commented 3 years ago

You can try the script at the bottom of the updated notebook: https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

@oi02lyl I am having similar issues with the checkpoint not being found. I tried to use the link you posted however it states I do not have the correct credentials to see the file. Do you mind trying to open the notebook to public viewing? Thanks.

saharyi commented 3 years ago

https://colab.research.google.com/drive/1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2?usp=sharing&authuser=3#scrollTo=9qxLFPNvcGgH

this link don't works for me! I get this: There was an error loading this notebook. Ensure that the file is accessible and try again. Invalid Credentials https://drive.google.com/drive/?action=locate&id=1zCg6BeCWVj62uYqoxR5rfyEG6dfGXu_2&authuser=3 please help me.

megagonlabs / ditto

Code implementation - Training #10