salesforce / CodeTF

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM
Apache License 2.0
1.45k stars 100 forks source link

Unable to run example humaneval code #27

Open yaoyanglee opened 1 year ago

yaoyanglee commented 1 year ago

`!pip install sentencepiece from codetf.models import load_model_pipeline from codetf.data_utility.human_eval_dataset import HumanEvalDataset from codetf.performance.model_evaluator import ModelEvaluator import os

os.environ["HF_ALLOW_CODE_EVAL"] = "1" os.environ["TOKENIZERS_PARALLELISM"] = "true"

model_class = load_model_pipeline(model_name="causallm", task="pretrained", model_type="codegen-350M-mono", is_eval=True, load_in_8bit=True, weight_sharding=False)

dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer()) prompt_token_ids, prompt_attention_masks, references = dataset.load()

problems = TensorDataset(prompt_token_ids, prompt_attention_masks)

evaluator = ModelEvaluator(model_class) avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references) print("Pass@k: ", avg_pass_at_k)`

Above is the code that was used. During execution in Google Colab, I received the error, in <cell line: 15>:15 │ │ │ │ /usr/local/lib/python3.10/dist-packages/codetf/data_utility/human_eval_dataset.py:29 in load │ │ │ │ 26 │ │ │ unit_test = re.sub(r'METADATA = {[^}]*}', '', unit_test, flags=re.MULTILINE) │ │ 27 │ │ │ references.append(unit_test) │ │ 28 │ │ │ │ ❱ 29 │ │ prompt_token_ids, prompt_attention_masks = self.process_data(prompts, use_max_le │ │ 30 │ │ │ │ 31 │ │ return prompt_token_ids, prompt_attention_masks, references │ │ 32 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: BaseDataset.process_data() got an unexpected keyword argument 'use_max_length'

After looking through the source code I don't seem to see this keyword argument, apart from max_length. Would anyone mind shedding some light on the issue?

Luxios22 commented 1 year ago

Same issue here.

Luxios22 commented 1 year ago

After I tried to remove the keyword, it also generates the error like the following: NameError: name 'TensorDataset' is not defined I think this is something missing in the import part. After I fixed all things mentioned above, it began to work.

And I looked into the package(1.0.1.1) installed on my local server, I found the codes for this version did not sync with the main branch of the repo. It seems the latest main branch has fixed this issue. So I think we can fix it by reinstall the package from the repo rather than pip.

yaoyanglee commented 1 year ago

For the TensorDataSet NameError, I found that adding this line solves the issue from torch.utils.data import TensorDataset

yaoyanglee commented 1 year ago

I would recommend upgrading numpy as well.