Open bhargav25dave1996 opened 6 months ago
As it notes,
Checkpoint path should be absolute
.
This is for compatibility w/ cloud storage.
@luyug I am using Google Cloud TPU V4-8 as suggested. Can you help with the issue?
--checkpoint_dir retriever-mistral-jax
change to something like /home/<your user name>/retriever-mistral-jax
should work.
@MXueguang Thanks, it resole issue.
I am facing an issue in encoding, I am using the below code to encode msmarco.
python -m tevatron.tevax.experimental.mp.encode \ --model_type mistral \ --model_name_or_path mistralai/Mistral-7B-v0.1 \ --model_config_name_or_path mistralai/Mistral-7B-v0.1 \ --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \ --dataset_name_or_path Tevatron/msmarco-passage-corpus \ --output_dir /mnt/disk/corpus-embedding \ --batch_size 32 \ --input_type passage \ --max_seq_length 128 \ --mesh_shape 1 -1 \ --lora /mnt/disk/retriever-mistral-jax/lora \ --scale_by_dim
But it does not save embedding at the output path. Please find the screenshot for the same below. @MXueguang @luyug can you help me in this?
at time of training i am geeting this error :
code : python -m tevatron.tevax.experimental.mp.train_lora \ --checkpoint_dir retriever-mistral-jax \ --train_file Tevatron/msmarco-passage-aug \ --model_name mistralai/Mistral-7B-v0.1 \ --model_type mistral \ --batch_size 128 \ --num_target_passages 16 \ --learning_rate 1e-4 \ --seed 12345 \ --mesh_shape 1 -1 \ --weight_decay 0.00001 \ --num_epochs 1 \ --max_query_length 64 \ --max_passage_length 128 \ --pooling eos \ --scale_by_dim True \ --grad_cache \ --passage_num_chunks 32 \ --query_num_chunks 4
Error:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 394, in
main()
File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 375, in main
checkpoint_manager.save(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 515, in save
self._checkpointers[k].save(item_dir, item, *kwargs)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/async_checkpointer.py", line 281, in save
commit_ops = asyncio.run(self._handler.async_save(tmpdir, args=ckpt_args))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 835, in async_save
commit_futures = await asyncio.gather(serialize_ops)
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1376, in serialize
tspec = self._get_json_tspec_write(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1273, in _get_json_tspec_write
tspec = self._get_json_tspec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1253, in _get_json_tspec
tspec: Dict[str, Any] = get_tensorstore_spec(
File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 821, in get_tensorstore_spec
raise ValueError(f'Checkpoint path should be absolute. Got {directory}')
ValueError: Checkpoint path should be absolute. Got retriever-mistral-jax/0.orbax-checkpoint-tmp-1711610071493337/lora.orbax-checkpoint-tmp-1711610103114668