ValueError while training corrector

siebeniris commented 9 months ago

Hi,

I am training a corrector following the steps in README. However, I got this error

File "/home/xxx/vec2text/vec2text/experiments.py", line 759, in load_trainer
    ) = vec2text.aliases.load_experiment_and_trainer_from_alias(
  File "/home/xxx/vec2text/vec2text/aliases.py", line 68, in load_experiment_and_trainer_from_alias
    experiment, trainer = vec2text.analyze_utils.load_experiment_and_trainer(
  File "/home/xxx/vec2text/vec2text/analyze_utils.py", line 111, in load_experiment_and_trainer
    trainer._load_from_checkpoint(checkpoint)
  File "/home/xxx/vec2text/vec2text/trainers/base.py", line 539, in _load_from_checkpoint
    raise ValueError(
ValueError: Can't find a valid checkpoint at /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000

This might be a problem from transformers. Which version of transformers are you using ?

Thank you!

siebeniris commented 9 months ago

I have tried out different versions of transformers, all of them have errors as follows:

4.28.0. RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. (Process rank=-1) 4.28.1. RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. (Process rank=-1) 4.29.0 ValueError: Can't find a valid checkpoint at … 4.29.1 ValueError: Can't find a valid checkpoint at … 4.33.1. (vec2text ) ValueError: Can't find a valid checkpoint at… 4.35.2 ValueError: Can't find a valid checkpoint at…

any idea how to tackle this?

The given checkpoint must be fine, because it shows as follows for 4.28.0 and 4.28.1:

> checkpoint: /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000
Experiment output_dir = ./saves/bert
pytorch_model.bin: 100%|██████████| 892M/892M [00:09<00:00, 98.6MB/s]

Ariya12138 commented 9 months ago

Excuse me, may I ask how long it took you to train your inversion model? using 100 epochs?

siebeniris commented 9 months ago

Excuse me, may I ask how long it took you to train your inversion model? using 100 epochs?

hi, I haven't finished 100 epochs, now it is only on 37 epochs and it has been almost 20 hours. But I am training on a single GPU, since there are problems with DDP.

Ariya12138 commented 9 months ago

Excuse me, may I ask how long it took you to train your inversion model? using 100 epochs?

hi, I haven't finished 100 epochs, now it is only on 37 epochs and it has been almost 20 hours. But I am training on a single GPU, since there are problems with DDP.

oh,thanks very much.

jxmorris12 commented 9 months ago

Is this a real directory? /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000

Seems pretty straightforward to me. You are loading a corrector with a path defined in the file aliases.py, which points to this path /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000. Try running ls /home/xxx/vec2text/saves/inverters/bert/checkpoint-58000 from the command-line; I doubt there will be an available model checkpoint at that path. At least that's what transformers is claiming.

jxmorris12 commented 9 months ago

By the way, let me know if there are any documentation improvements that you'd recommend.

siebeniris commented 9 months ago

Hi @jxmorris12 , thanks very much for the suggestion! Yes, it is a real directory. I ran it in my local computer, it can be loaded, but it is not working in our servers (something must be incompatible there), so the directories are alright, and the checkpoint can be loaded and "precomputing train hypothesis" has been done when I tried to train the corrector. But I have ran into this problem (just want to try it out now on my local computer, MacOS), if you have any idea what's going on there... I think I have the right pytorch version for my OS.

File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/transformers/trainer.py", line 481, in __init__
    self._move_model_to_device(model, args.device)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/transformers/trainer.py", line 716, in _move_model_to_device
    model = model.to(device)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2271, in to
    return super().to(*args, **kwargs)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/Users/xxx/anaconda3/envs/v2t/lib/python3.10/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I have removed certain assert torch.cuda.is_available() from experiments.py and also changed a bit "device" just to accommodate the OS.

Regarding documentation, thanks very much for the great work! It has been working for me so far. But I might make recommendations in the coming days if that's alright.

Thank you :)

siebeniris commented 9 months ago

Some things I noticed on the way:

might be good to specify the openai version pip install openai==0.28

in experiments.py, the num_proc can be changed if people want to use on other platforms:

def _get_num_proc() -> int:
try:
    # NOTE: only available on some Unix platforms
    return (len(os.sched_getaffinity(0)) // self._world_size)  # type: ignore[attr-defined]
except AttributeError:
    return ( multiprocessing.cpu_count() // self._world_size)

jxmorris12 commented 9 months ago

Thanks @siebeniris! These improvements all sound useful, please submit a pull request so other users can benefit from your work! :)

In regards to your error: it looks like the huggingface trainer is still trying to move a model to GPU even though you don't have one available. If I'm understanding correctly, you've trained an inversion model w/ CUDA and now you're trying to use it to facilitate training a corrector model without CUDA available. Sorry, I've never gotten the code into exactly this state before so I can't provide more guidance.

siebeniris commented 9 months ago

Hi @jxmorris12 , thanks for the insights!

jxmorris12 / vec2text

ValueError while training corrector #14