Closed Sunt-ing closed 2 months ago
My command:
python -m sarathi.benchmark.capacity_search.main --tbt-slo-value 0.5 --config-path ./osdi-experiments/figure-5-6/mistral7b_relaxed.yml --output-dir ./capacity_search_output_mistral7b_relaxed
Hi @Sunt-ing, unfortunately due to license issues, we are not able to release certain traces.
Each trace file like the above sharegpt_8k_filtered_stats_llama2_tokenizer.csv
has two columns: num_prefill_tokens
and num_decode_tokens
.
The sharegpt_8k
trace is obtained by processing openchat_sharegpt4_dataset specifically the file sharegpt_gpt4.json. You can write a script to tokenize the text in this file using llama2 tokenizer.
@nitinkedia7 Thanks! btw, how were the num_prefill_tokens and num_decode_tokens in multi-round conversation counted?
@Sunt-ing This is how the traces used in the paper are obtained:
splitwise_conv
: It is taken from AzurePublicDataset/data/AzureLLMInferenceTrace_conv.csv at master · Azure/AzurePublicDataset (github.com). In some experiments, we take only the length data from above, not the arrival timestamps. For arrival times, we use poisson in that case.splitwise_code
: Same as splitwise_conv
. Source: AzurePublicDataset/data/AzureLLMInferenceTrace_code.csv at master · Azure/AzurePublicDataset (github.com)arxiv_summarization
: ccdv/arxiv-summarization · Datasets at Hugging FaceHere each entry is a arxiv paper which has three fields: id, article (text containing body of the paper) and abstract (text containing abstract of the paper). We tokenize the article text to get num_prefill_tokens and the abstract gives the num_decode_tokens.chat
workload in the vidur paper refers to lmsys-1m. Here each row in the dataset is a conversation between the user and the assistant.
Each conversation is of the format:
[
{ "content": ..., "role": "user" },
{ "content": ..., "role": "assistant" },
{ "content": ..., "role": "user" },
{ "content": ..., "role": "assistant" },
]
Each conversation results in one or more requests in the trace. We keep a running sum of the tokens in each of the "content" in the above list. Whenever we encounter the assistant's turn, a new request is generated with num_prefill_tokens = sum_of_tokens_in_all_previous_turns and num_decode_tokens=num_of_tokens_in_the_current_turn (assistants turn) Request order is randomized, even the order of requests from the same conversation are not kept.
sharegpt
: This was obtained from openchat/openchat_sharegpt4_dataset at main (huggingface.co) in the same way as the lmsys-1m dataset was processed. openchat_8192.train.text.json
is most likely the file used to generate the trace based on number of requests in the trace and the raw dataset (approximately 32k). Other candidates are sharegpt_clean.json
and sharegpt_gpt4.json
.@nitinkedia7 Thanks! 5. For lmsys-1m, the running sum for num_prefill_tokens should be calculated where turn > 1 otherwise when turn=1 (i.e single conversation) num_prefill_tokens should be = llama2_tokenizer(user content)! If so I get mean of num_prefill_token for all samples by llama2 tokanizer = 713.03
But I observe that even for turn =1 num_prefill_tokens is calculated as llama2_tokenizer(user content) + llama2_tokenizer(assistant content) then only the mean value (785.78) of num_prefill_tokens column from https://github.com/project-etalon/etalon/blob/main/data/processed_traces/lmsys_chat_1m_conversation_stats_llama2_tokenizer.csv is matching. There is no issue with num_decode_tokens . Kindly clarify num_prefil_tokens calculation!
@nitinkedia7 6. sharegpt: The total number of traces in https://github.com/project-etalon/etalon/blob/main/data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv is 30731. I obserevd that if some content either Human or Assistant not ending with <|end_of_turn|> that is ignored, so I also followed the same. Here, I see the order of csv and chat is maintained as it is in _openchat8192.train.text.json. For conversation 11, for 11 turns the num_prefill_tokens shoud be as follows [866, 1985, 3179, 3340, 3520, 3631, 4745, 5141, 5636, 6195, 6823 ], But in the csv (row 36 to row 43) it is given as [866, 3340, 3520, 3631, 5141, 5636, 6195, 6823 ], Please clarify why turn 2,3 and turn 7 is not calculated and only 8 turns out of 11 (row 36 to row 43) is reported in the csv. I think there might be many more such cases, for the same reason I am getting total 31548 traces whereas the trace csv contains 30731 rows! Kindly clarify this! Please find the code and corresponding csv I got https://github.com/subhendukhatuya/LLMInference Plz check conversation 11 as an example!
@nitinkedia7 4. arxiv_summarization: ccdv/arxiv-summarization · Datasets at Hugging Face There is total 216K rows but I see only 186206 rows in the trace file here https://github.com/project-etalon/etalon/blob/main/data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv
Kindly clarify!
Hi @subhendukhatuya,
lmsys-1m
, you are right that there is overcounting happening for num_prefill_counts. num_prefill_tokens
should have sum of all the previous messages, not the current message. Thanks for pointing out! sharegpt
, it looks like 2.6% of the requests are missing. In the experiments, few thousands of randomly selected requests are used. So, these missing requests should not cause a significant drift in the nature of the trace.arxiv_summarization
, it is likely that only the train
split was used. That means 9.1% of the requests are missing. This dataset has uniform looking requests as there are no multi-turn conversations where tokens get added up. So, these missing requests should not cause a significant drift in the nature of the trace.Thank you for your interest in this project. I've provided the information I currently have available on this issue. Unfortunately, I won't be able to offer additional clarifications or assistance at this time due to other commitments. I encourage you to use your own judgement while parsing the raw datasets. Thanks for your understanding.
Hi, currently the data dir is missing, making the following errors occur.
FileNotFoundError: [Errno 2] No such file or directory: './data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv'
Did I miss any preprocessing steps?