noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Performance issue #36

Open tom-doerr opened 10 months ago

tom-doerr commented 10 months ago

Using lm-format-enforcer makes generating take 3x longer for me, am I doing something wrong? With: Took 67.46 seconds Without: Took 22.94 seconds

class TweetGenFormat(BaseModel):
    tweet: str
    tweet_quality_reasoning: str
    rating_1_to_9: float
    sounds_awkward: bool
    is_about_ai: bool
    is_about_climate_change: bool
    list_of_topics: list
    # when_to_post: datetime
    # tweet_url: HttpUrl
    # tweet_url: str

def generate_tweets():
    prompt = 'Generate a great tweet. Respond with a json object.'
    prompt_whole = get_prompt_whole(prompt)
    parser = JsonSchemaParser(TweetGenFormat.model_json_schema())
    prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
    batch_size = 16
    tokens = tokenizer(
    # prompt_template,
    [prompt_whole] * batch_size,
    return_tensors='pt'
    ).input_ids.cuda()

    start = time.time()
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**10,
        prefix_allowed_tokens_fn=prefix_function,
    )
    print(f'Took {time.time() - start:.2f} seconds')
    start = time.time()
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**10,
        # prefix_allowed_tokens_fn=prefix_function,
    )
    print(f'Took {time.time() - start:.2f} seconds')
tom-doerr commented 10 months ago
$ nvidia-smi
Fri Dec  8 17:41:43 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:00:10.0 Off |                    0 |
| N/A   23C    P0              34W / 250W |  39322MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:00:11.0 Off |                    0 |
| N/A   24C    P0              39W / 250W |  37567MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
mhillebrand commented 9 months ago

Have you tried Guidance, Guardrails, or Outlines? I'm just curious how those compare in terms of time.

noamgat commented 9 months ago

There are a few reasons for this type of behavior. One thing to make sure - is the length of the result the same with and without the enforcer? Tokens per second is the correct apples to apples comparison metric. Can you print that as well?

tom-doerr commented 9 months ago

All the other libraries I tried didn't have batch processing (@mhillebrand). Using the hf pipeline seems pretty slow for some reason, time scales with batch size.

Pipeline - Batch size: 1, time: 6.89s                                                                                                                                                                                                                                             
Generate (schema) - Batch size: 1, time: 6.28s                                                                                                                                                                                                                                    
Generate (no schema) - Batch size: 1, time: 6.01s                                                                                                                                                                                                                                 
--------------------------------------------------------------------------------                                                                                                                                                                                                  
Pipeline - Batch size: 2, time: 13.03s                                                                                                                                                                                                                                            
Generate (schema) - Batch size: 2, time: 7.00s                                                                                                                                                                                                                                    
Generate (no schema) - Batch size: 2, time: 6.82s                                                                                                                                                                                                                                 
--------------------------------------------------------------------------------                                                                                                                                                                                                  
Pipeline - Batch size: 4, time: 25.84s                                                                                                                                                                                                                                            
Generate (schema) - Batch size: 4, time: 8.33s                                                                                                                                                                                                                                    
Generate (no schema) - Batch size: 4, time: 9.24s                                                                                                                                                                                                                                 
--------------------------------------------------------------------------------                                                         
Pipeline - Batch size: 8, time: 51.42s                                                                                                   
Generate (schema) - Batch size: 8, time: 12.22s                     
Generate (no schema) - Batch size: 8, time: 14.00s                                                                                       
--------------------------------------------------------------------------------                        
Generate (schema) - Batch size: 16, time: 12.87s                                                                                         
Generate (no schema) - Batch size: 16, time: 15.06s                                                     
--------------------------------------------------------------------------------                                                         
Generate (schema) - Batch size: 32, time: 15.15s                                                        
Generate (no schema) - Batch size: 32, time: 17.66s                                                     
--------------------------------------------------------------------------------                        
Generate (schema) - Batch size: 64, time: 18.58s                                                                                         
Generate (no schema) - Batch size: 64, time: 17.24s                                                     
--------------------------------------------------------------------------------                                                         
Generate (schema) - Batch size: 128, time: 25.47s  
Generate (no schema) - Batch size: 128, time: 23.68s                                                                                     
--------------------------------------------------------------------------------                                                         
Generate (schema) - Batch size: 256, time: 35.83s                                                                                        
Generate (no schema) - Batch size: 256, time: 31.43s                       

Code:

for i in range(100):
    batch_size = 2 ** i
    prompts = [prompt] * batch_size
    if batch_size <= 8:
        start = time.time()
        output_dict = hf_pipeline(prompts, prefix_allowed_tokens_fn=prefix_function,
                max_new_tokens=2**6,
                )
        # print("output_dict:", output_dict)
        print(f'Pipeline - Batch size: {batch_size}, time: {time.time() - start:.2f}s')

    start = time.time()
    tokens = tokenizer(
        # prompt_template,
        [prompt] * batch_size,
        return_tensors='pt'
    ).input_ids.cuda()

    # # Generate output
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**6,
        prefix_allowed_tokens_fn=prefix_function,
        # prefix_allowed_tokens_fn=custom_prefix_function,
    )
    # print("generation_output:", generation_output)
    output_shape = generation_output.shape
    # print("output_shape:", output_shape)
    decoded_output = tokenizer.decode(generation_output[0], skip_special_tokens=False)
    # print("decoded_output:", decoded_output)
    print(f'Generate (schema) - Batch size: {batch_size}, time: {time.time() - start:.2f}s')
    start = time.time()
    tokens = tokenizer(
        # prompt_template,
        [prompt] * batch_size,
        return_tensors='pt'
    ).input_ids.cuda()

    # # Generate output
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=2**6,
        # prefix_allowed_tokens_fn=prefix_function,
        # prefix_allowed_tokens_fn=custom_prefix_function,
    )
    # print("generation_output:", generation_output)
    print(f'Generate (no schema) - Batch size: {batch_size}, time: {time.time() - start:.2f}s')
    print('-' * 80)
noamgat commented 9 months ago

Hi, from the last performance table, it looks like schema and no-schema take roughly the same amount of time. So it doesn't look like the format enforcer is a major performance hurdle (again, I would print time-per-token, not just total time, as result length may differ). Do you interpret the results differently?

tom-doerr commented 9 months ago

Generation in the above case gets cut off after 32 tokens, so it should generate roughly the same number of tokens between the different generation modes. You are right that there doesn't seem to be a difference between the model.generate calls, but using it with pipeline seems to not support batching. For the last example with batch size 256 that would likely mean 50x slower generation when using it with pipeline. If that is expected or irrelevant I don't mind if we close the issue.

NJordan72 commented 3 months ago

@noamgat have you considered using Cython to enhance performance?

I've been experimenting with refactoring some of the utility parsers (like Sequence and Union) to create a parser for a DSL, and the performance improvement is significant, even with my basic C/C++ skills.

The main downside is managing the build script, but this has been relatively minor. Below is an example of how I structured it (implementation details omitted):

cdef class CharacterLevelParser:
    def __init__(self):
        pass

    def __cinit__(self):
        pass

    cpdef CharacterLevelParser _clone(self):
        """
        Clone the current parser.

        Returns:
            CharacterLevelParser: A clone of the current parser.
        """
        pass

    def add_character(self, new_character: str) -> CharacterLevelParser:
        new_char = ord(new_character[0])
        return self._add_character(new_char)

    cpdef CharacterLevelParser _add_character(self, char new_character):
        pass

    def get_allowed_characters(self) -> str:
        cdef list char_list = [chr(c) for c in self._get_allowed_characters()]
        return "".join(char_list)

    cpdef cset[char] _get_allowed_characters(self):
        pass

    cpdef bint can_end(self):
        """
        Check if the parser can end.

        Returns:
            bool: True if the parser can end, False otherwise.
        """
        pass

    cpdef object cache_key(self):
        """
        Get the cache key for the parser.

        Returns:
            object: The cache key.
        """
        pass

Have you explored similar optimizations, or do you have any thoughts on integrating Cython into the project?

Dan-wanna-M commented 1 month ago

@tom-doerr I am skeptical that this is more relevant to huggingface's framework itself. Their logits processor with the same workload is just 10 times slower than vllm or exllamav2. One observation is that if I force the cuda to sync in the beginning and the end of a vllm's logits processor than it runs as slow as the huggingface, while if I do the same in huggingface then no speed difference is observed. I am not sure what this reveals though because logits masking inherently requires at least one cuda sync(since your mask is created on CPU.)