texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
494 stars 94 forks source link

Checkpoint path should be absolute #111

Open bhargav25dave1996 opened 6 months ago

bhargav25dave1996 commented 6 months ago

at time of training i am geeting this error :

code : python -m tevatron.tevax.experimental.mp.train_lora \ --checkpoint_dir retriever-mistral-jax \ --train_file Tevatron/msmarco-passage-aug \ --model_name mistralai/Mistral-7B-v0.1 \ --model_type mistral \ --batch_size 128 \ --num_target_passages 16 \ --learning_rate 1e-4 \ --seed 12345 \ --mesh_shape 1 -1 \ --weight_decay 0.00001 \ --num_epochs 1 \ --max_query_length 64 \ --max_passage_length 128 \ --pooling eos \ --scale_by_dim True \ --grad_cache \ --passage_num_chunks 32 \ --query_num_chunks 4

Error:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 394, in main() File "/home/irlab/tevatron/src/tevatron/tevax/experimental/mp/train_lora.py", line 375, in main checkpoint_manager.save( File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/checkpoint_manager.py", line 515, in save self._checkpointers[k].save(item_dir, item, *kwargs) File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/async_checkpointer.py", line 281, in save commit_ops = asyncio.run(self._handler.async_save(tmpdir, args=ckpt_args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete return future.result() File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/pytree_checkpoint_handler.py", line 835, in async_save commit_futures = await asyncio.gather(serialize_ops) File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1376, in serialize tspec = self._get_json_tspec_write( File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1273, in _get_json_tspec_write tspec = self._get_json_tspec( File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 1253, in _get_json_tspec tspec: Dict[str, Any] = get_tensorstore_spec( File "/home/irlab/.local/lib/python3.10/site-packages/orbax/checkpoint/type_handlers.py", line 821, in get_tensorstore_spec raise ValueError(f'Checkpoint path should be absolute. Got {directory}') ValueError: Checkpoint path should be absolute. Got retriever-mistral-jax/0.orbax-checkpoint-tmp-1711610071493337/lora.orbax-checkpoint-tmp-1711610103114668

luyug commented 6 months ago

As it notes,

Checkpoint path should be absolute.

This is for compatibility w/ cloud storage.

bhargav25dave1996 commented 6 months ago

@luyug I am using Google Cloud TPU V4-8 as suggested. Can you help with the issue?

MXueguang commented 6 months ago

--checkpoint_dir retriever-mistral-jax

change to something like /home/<your user name>/retriever-mistral-jax should work.

bhargav25dave1996 commented 6 months ago

@MXueguang Thanks, it resole issue.

I am facing an issue in encoding, I am using the below code to encode msmarco.

python -m tevatron.tevax.experimental.mp.encode \ --model_type mistral \ --model_name_or_path mistralai/Mistral-7B-v0.1 \ --model_config_name_or_path mistralai/Mistral-7B-v0.1 \ --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \ --dataset_name_or_path Tevatron/msmarco-passage-corpus \ --output_dir /mnt/disk/corpus-embedding \ --batch_size 32 \ --input_type passage \ --max_seq_length 128 \ --mesh_shape 1 -1 \ --lora /mnt/disk/retriever-mistral-jax/lora \ --scale_by_dim

But it does not save embedding at the output path. Please find the screenshot for the same below. @MXueguang @luyug can you help me in this? Screenshot 2024-04-02 072835