neuralmagic / guidellm

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
Apache License 2.0
158 stars 11 forks source link

[Dataset]: Iterate through benchmark dataset once #44

Closed philschmid closed 4 weeks ago

philschmid commented 2 months ago

Hello,

Really great tool! Thank you for releasing it. I am currently testing it using an HF dataset. I am wondering if you are planning to support for iterating through the dataset only once?

I have a dataset of 2,500 samples. And i would like to benchmark it once and if all sample are done stop.

markurtz commented 2 months ago

Hi @philschmid,

Thanks for the feedback, and I'm glad you're finding the tool useful! Your request makes a lot of sense, and we can definitely prioritize adding this feature.

@parfeniukink, could you take the lead on this? Let's aim to have it ready by the end of the week. A rough outline that should enable minimal code changes:

  1. Add a Constant Argument: Introduce a new option for the -max-requests argument, allowing users to pass in a string like "dataset".
  2. Update Main Script: Modify the main script to check for this new argument. If present, the script should automatically retrieve the dataset's original length from the request generator that's being used (File and Transformers should be supported, emulated we can raise an error on and say not supported) and set it as the value for -max-requests.
  3. Implement and Test: Ensure the new pathway works as expected and passes all relevant tests.

This should provide a straightforward way for users to run benchmarks on their datasets without looping indefinitely.

parfeniukink commented 2 months ago

Hello,

Really great tool! Thank you for releasing it. I am currently testing it using an HF dataset. I am wondering if you are planning to support for iterating through the dataset only once?

I have a dataset of 2,500 samples. And i would like to benchmark it once and if all sample are done stop.

Hey! Could you also provide the dataset that you've been using?

philschmid commented 2 months ago

Hey, I created this dataset: https://huggingface.co/datasets/philschmid/text-to-sql-dataset-medusa-test-chatml

parfeniukink commented 2 months ago

Hey @philschmid. Let's move to this PR, so we can fit the code better according to what do you want. Will write you a couple of messages there.

philschmid commented 1 month ago

Thank you for working on this. While experimenting a bit more i noticed that we might have a different problem we need to solve.

Currently, guidellm uses the "user" role to send requests to the backend. Meaning if you have a dataset that has "conversation," e.g., system + user + assistant. And you want to benchmark it. You can only use the "user" content, which might be problematic if you want to benchmark speculative models or other inputs where the system message or previous turns are important, e.g. encoding time etc.

parfeniukink commented 1 month ago

Thank you for working on this. While experimenting a bit more i noticed that we might have a different problem we need to solve.

Currently, guidellm uses the "user" role to send requests to the backend. Meaning if you have a dataset that has "conversation," e.g., system + user + assistant. And you want to benchmark it. You can only use the "user" content, which might be problematic if you want to benchmark speculative models or other inputs where the system message or previous turns are important, e.g. encoding time etc.

Yeah. Sounds as a completely another topic for a discussion. @markurtz following this one for you as well