missing data files - Githubissues

Sunt-ing commented 2 months ago

Hi, currently the data dir is missing, making the following errors occur.

FileNotFoundError: [Errno 2] No such file or directory: './data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv'

Did I miss any preprocessing steps?

Sunt-ing commented 2 months ago

My command:

python -m sarathi.benchmark.capacity_search.main --tbt-slo-value 0.5 --config-path ./osdi-experiments/figure-5-6/mistral7b_relaxed.yml --output-dir ./capacity_search_output_mistral7b_relaxed

nitinkedia7 commented 2 months ago

Hi @Sunt-ing, unfortunately due to license issues, we are not able to release certain traces. Each trace file like the above sharegpt_8k_filtered_stats_llama2_tokenizer.csv has two columns: num_prefill_tokens and num_decode_tokens. The sharegpt_8k trace is obtained by processing openchat_sharegpt4_dataset specifically the file sharegpt_gpt4.json. You can write a script to tokenize the text in this file using llama2 tokenizer.

Sunt-ing commented 2 months ago

@nitinkedia7 Thanks! btw, how were the num_prefill_tokens and num_decode_tokens in multi-round conversation counted?

nitinkedia7 commented 1 month ago

@Sunt-ing This is how the traces used in the paper are obtained:

splitwise_conv: It is taken from AzurePublicDataset/data/AzureLLMInferenceTrace_conv.csv at master · Azure/AzurePublicDataset (github.com). In some experiments, we take only the length data from above, not the arrival timestamps. For arrival times, we use poisson in that case.
splitwise_code: Same as splitwise_conv. Source: AzurePublicDataset/data/AzureLLMInferenceTrace_code.csv at master · Azure/AzurePublicDataset (github.com)
BWB: EleanorJiang/BlonDe: Official implementations for (1) BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation and (2) Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus (github.com). Here there are many books. Each chapter of every book book has two text files: one for english and the other for chinese (see chs_re.txt and ref_re.txt). Line 1 in ref_re.txt corresponds to Line 1 in chs_re.txt. Line X in ref_re.txt corresponds to Line X in chs_re.txt. For each such line pairs, we tokenize using LLama2 tokenizer and get the num_prefill_tokens from english and num_decode_tokens from chinese. There is no ordering. When simulation is runs, requests from the traces are selected at random order.
arxiv_summarization: ccdv/arxiv-summarization · Datasets at Hugging FaceHere each entry is a arxiv paper which has three fields: id, article (text containing body of the paper) and abstract (text containing abstract of the paper). We tokenize the article text to get num_prefill_tokens and the abstract gives the num_decode_tokens.
chat workload in the vidur paper refers to lmsys-1m. Here each row in the dataset is a conversation between the user and the assistant. Each conversation is of the format:
```
[
{ "content": ..., "role": "user" },
{ "content": ..., "role": "assistant" },
{ "content": ..., "role": "user" },
{ "content": ..., "role": "assistant" },
]
```
Each conversation results in one or more requests in the trace. We keep a running sum of the tokens in each of the "content" in the above list. Whenever we encounter the assistant's turn, a new request is generated with num_prefill_tokens = sum_of_tokens_in_all_previous_turns and num_decode_tokens=num_of_tokens_in_the_current_turn (assistants turn) Request order is randomized, even the order of requests from the same conversation are not kept.
sharegpt: This was obtained from openchat/openchat_sharegpt4_dataset at main (huggingface.co) in the same way as the lmsys-1m dataset was processed. openchat_8192.train.text.json is most likely the file used to generate the trace based on number of requests in the trace and the raw dataset (approximately 32k). Other candidates are sharegpt_clean.json and sharegpt_gpt4.json.

subhendukhatuya commented 1 month ago

@nitinkedia7 Thanks! 5. For lmsys-1m, the running sum for num_prefill_tokens should be calculated where turn > 1 otherwise when turn=1 (i.e single conversation) num_prefill_tokens should be = llama2_tokenizer(user content)! If so I get mean of num_prefill_token for all samples by llama2 tokanizer = 713.03

But I observe that even for turn =1 num_prefill_tokens is calculated as llama2_tokenizer(user content) + llama2_tokenizer(assistant content) then only the mean value (785.78) of num_prefill_tokens column from https://github.com/project-etalon/etalon/blob/main/data/processed_traces/lmsys_chat_1m_conversation_stats_llama2_tokenizer.csv is matching. There is no issue with num_decode_tokens . Kindly clarify num_prefil_tokens calculation!

subhendukhatuya commented 1 month ago

@nitinkedia7 6. sharegpt: The total number of traces in https://github.com/project-etalon/etalon/blob/main/data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv is 30731. I obserevd that if some content either Human or Assistant not ending with <|end_of_turn|> that is ignored, so I also followed the same. Here, I see the order of csv and chat is maintained as it is in _openchat8192.train.text.json. For conversation 11, for 11 turns the num_prefill_tokens shoud be as follows [866, 1985, 3179, 3340, 3520, 3631, 4745, 5141, 5636, 6195, 6823 ], But in the csv (row 36 to row 43) it is given as [866, 3340, 3520, 3631, 5141, 5636, 6195, 6823 ], Please clarify why turn 2,3 and turn 7 is not calculated and only 8 turns out of 11 (row 36 to row 43) is reported in the csv. I think there might be many more such cases, for the same reason I am getting total 31548 traces whereas the trace csv contains 30731 rows! Kindly clarify this! Please find the code and corresponding csv I got https://github.com/subhendukhatuya/LLMInference Plz check conversation 11 as an example!

subhendukhatuya commented 1 month ago

@nitinkedia7 4. arxiv_summarization: ccdv/arxiv-summarization · Datasets at Hugging Face There is total 216K rows but I see only 186206 rows in the trace file here https://github.com/project-etalon/etalon/blob/main/data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv

Kindly clarify!

nitinkedia7 commented 1 month ago

Hi @subhendukhatuya,

For lmsys-1m, you are right that there is overcounting happening for num_prefill_counts. num_prefill_tokens should have sum of all the previous messages, not the current message. Thanks for pointing out!
For sharegpt, it looks like 2.6% of the requests are missing. In the experiments, few thousands of randomly selected requests are used. So, these missing requests should not cause a significant drift in the nature of the trace.
For arxiv_summarization, it is likely that only the train split was used. That means 9.1% of the requests are missing. This dataset has uniform looking requests as there are no multi-turn conversations where tokens get added up. So, these missing requests should not cause a significant drift in the nature of the trace.

Thank you for your interest in this project. I've provided the information I currently have available on this issue. Unfortunately, I won't be able to offer additional clarifications or assistance at this time due to other commitments. I encourage you to use your own judgement while parsing the raw datasets. Thanks for your understanding.

microsoft / sarathi-serve

missing data files #38