songmzhang / DSKD

Repo for Paper "Dual-Space Knowledge Distillation for Large Language Models".
29 stars 3 forks source link

using mistral from #14

Closed survivebycoding closed 2 weeks ago

survivebycoding commented 3 weeks ago

I am getting this error when i hardcoded the tokenizer path and used this mistral model as teacher:

Traceback (most recent call last): rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_file rank0: resolved_file = hf_hub_download( rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f rank0: return f(*args, **kwargs) rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn

rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id rank0: raise HFValidationError( rank0: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/tokenizer.model.v3'. Use repo_type argument if needed.

rank0: The above exception was the direct cause of the following exception:

rank0: Traceback (most recent call last): rank0: File "/DSKD/code/distillation.py", line 618, in

rank0: File "/DSKD/code/distillation.py", line 564, in main rank0: distiller = Distiller(args, device) rank0: File "/DSKD/code/distiller.py", line 25, in init rank0: self.student_model, self.student_tokenizer = self.load_student_model() rank0: File "/DSKD/code/distiller.py", line 171, in load_student_model rank0: tokenizer = self.load_tokenizer(self.args.model_type, self.args.model_path) rank0: File "/DSKD/code/distiller.py", line 90, in load_tokenizer rank0: tokenizer = AutoTokenizer.from_pretrained('/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/tokenizer.model.v3', trust_remote_code=True) rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 833, in from_pretrained rank0: tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 665, in get_tokenizer_config rank0: resolved_config_file = cached_file( rank0: File "/DSKD/llm_kd/lib/python3.10/site-packages/transformers/utils/hub.py", line 466, in cached_file rank0: raise EnvironmentError( rank0: OSError: Incorrect path_or_model_id: '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/tokenizer.model.v3'. Please provide either the path to a local folder or the repo_id of a model on the Hub. E0821 08:56:38.202000 139925583940480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 81439) of binary: /data3/esenris_rishika/DSKD/llm_kd/bin/python3.10

Any idea how to resolve this issue?

songmzhang commented 3 weeks ago

It seems that you provided an incorrect model_path for AutoTokenizer. Please make sure the model path is the folder of your local model.

survivebycoding commented 3 weeks ago

our model folder is this: image The model path is: /data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated

the path provided is this: tokenizer = AutoTokenizer.from_pretrained('/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/tokenizer.model.v3', trust_remote_code=True) The paths for teacher we have provided is : TEACHER_MODEL_PATH="/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated" TEACHER_PEFT_PATH="/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated"

where are we going wrong

survivebycoding commented 3 weeks ago

I changed the model path to the folder:

tokenizer = AutoTokenizer.from_pretrained('/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/', trust_remote_code=True)

new error:

OSError: Can't load tokenizer for '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer. E0821 10:38:37.928000 140023310101376 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 84790) of binary: /DSKD/llm_kd/bin/python3.10

songmzhang commented 3 weeks ago

I changed the model path to the folder:

tokenizer = AutoTokenizer.from_pretrained('/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/', trust_remote_code=True)

new error:

OSError: Can't load tokenizer for '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/data2/mistral-finetune/telequad_gaia_nopretrained_finetuned_mistral/checkpoints/checkpoint_000300/consolidated/' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer. E0821 10:38:37.928000 140023310101376 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 84790) of binary: /DSKD/llm_kd/bin/python3.10

This path seems ok. Please try to modify the name of tokenizer.model.v3 to tokenizer.model and rerun it. If it still does not work, you can leave an issue in the repo you get the model.

survivebycoding commented 3 weeks ago

tried, got this error:

[rank0]: raise ValueError( [rank0]: ValueError: Couldn't instantiate the backend tokenizer from one of: [rank0]: (1) atokenizerslibrary serialization file, [rank0]: (2) a slow tokenizer instance to convert or [rank0]: (3) an equivalent slow tokenizer class to instantiate and convert. [rank0]: You need to have sentencepiece installed to convert a slow tokenizer to a fast one. E0821 11:56:09.375000 140524088064896 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 89387) of binary: a/DSKD/llm_kd/bin/python3.10

songmzhang commented 3 weeks ago

It is clear that this message suggests you install sentencepiece. You can install it with pip install sentencepiece and retry it.