Batch inference - Githubissues

RyanChen1997 commented 1 year ago

Sorry, I am new for it. According to the code in inference_wizardcoder.py, i have created a service and performed benchmark-test. The result is: when the concurrency is 5, it takes about 35s on average. I want to reduce time cost and increase concurrency. Now one request is processed once (cal model with one input every time), is there any concurrency to allow multiple requests to be processed once?

for num, line in enumerate(input_data):
        one_data = line
        id = one_data["idx"]
        instruction = one_data["Instruction"]
        print(instruction)
        _output = evaluate(instruction, tokenizer, model) # call model with one input every time
        final_output = _output[0].split("### Response:")[1].strip()
        new_data = {
            "id": id,
            "instruction": instruction,
            "wizardcoder": final_output
        }
        output_data.write(new_data)

I want to change the logic let it can batch inference. Thanks a lot!

anmolagarwal999 commented 1 year ago

@ChiYeungLaw @nlpxucan I am trying to run batch inference by making the small change on this line. However, since different inputs may not be of the same size, there needs to be a left_side padding done on the smaller inputs.

My question is what should be the padding token to be used. The default padding token (ie tokenizer.pad_token) is: '[PAD]'. However, I have some examples online (such as this and this) which explicitly set this padding token to be tokenizer.eos_token ie '<|endoftext|>'.

What is the correct padding token to be used ? Thanks.

RyanChen1997 commented 1 year ago

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

jaideep11061982 commented 1 year ago

@RyanChen1997 can you also provide the definition of self._generate_prompt ??

RyanChen1997 commented 1 year ago

@jaideep11061982 Just copy the function named generate_prompt from https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/inference_wizardcoder.py

jaideep11061982 commented 1 year ago

@RyanChen1997 thank you.. how to load wizardLM in multiple gpus ,simple ddp will work ?

prabhatp251 commented 9 months ago

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

@RyanChen1997: Shouldn't you also pass inputs["attention_mask"] to generate fn when using batch inference? If not, the default attention_mask will be all 1s, ie attending to even pad tokens (cf. https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1572)

nlpxucan / WizardLM

Batch inference #140