krai / axs2mlperf

Automated KRAI X workflows for reproducing MLPerf Inference submissions
https://krai.ai
MIT License
1 stars 1 forks source link

Convert Llama2 pickle to Llama3 #55

Open G4V opened 2 months ago

G4V commented 2 months ago
import pickle
from functools import partial

import pandas as pd
from transformers import AutoTokenizer

llama_prompt_system = "<|start_header_id|>system<|end_header_id|>\n\n{}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
llama_prompt_no_system = "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

def format_llama_input(row):
    if row['system_prompt']:
        return llama_prompt_system.format(row['system_prompt'], row['question'])
    else:
        return llama_prompt_no_system.format(row['question'])

def _tokenize_helper(x, llama_tokenizer=None):
    if not isinstance(x, str):
        return []

    return llama_tokenizer(x)["input_ids"]

input_pkl = sys.argv[1]  #"/local/mnt/workspace/gsimpson/work_collection_old/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
output_pkl = sys.argv[2] #"/local/mnt/workspace/gsimpson/work_collection/downloaded_openorca_mlperf_dataset_llama3_full/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
with open(input_pkl, "rb") as f:
    df = pickle.load(f)

df["input"] = df.apply(format_llama_input, axis=1)

input_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
output_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
df['tok_input'] = df['input'].apply(input_tokenizer)
df['tok_output'] = df['output'].apply(output_tokenizer)
df['tok_input_length'] = df['tok_input'].apply(lambda x: len(x))
df['tok_output_length'] = df['tok_output'].apply(lambda x: len(x))

print(df["input"][0])
print(input_tokenizer(df["input"][0]))
print(df["tok_input"][0] == input_tokenizer(df["input"][0]))

with open(output_pkl, "wb") as f:
    pickle.dump(df, f)
maria-18-git commented 2 months ago

1. Download hf_tokeniser

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,hf_tokeniser,model_family=llama2,variant=7b,hf_token=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
...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Llama-2-7b-chat-hf" --include "tokenizer*" --local-dir "/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser" --local-dir-use-symlinks False --token=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
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/lib/python3.8/site-packages/huggingface_hub/commands/download.py:132: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Fetching 3 files:   0%|                                                                                                                                                                      | 0/3 [00:00<?, ?it/s]
Downloading 'tokenizer.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.json.a6e931b92caff4c79c5c56282f1e89569a0ae558.incomplete'
Downloading 'tokenizer.model' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.incomplete'
Downloading 'tokenizer_config.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer_config.json.a0024735c8dd7afe47fe72792b2c4edaff63bd3b.incomplete'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<00:00, 363kB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer_config.json                                          | 0.00/1.84M [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.36MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.json████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.38MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 2.52MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.model█████████████████████████████████████████| 500k/500k [00:00<00:00, 2.53MB/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.29it/s]
/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_Llama-2-7b-chat-hf_tokeniser']

real    0m6.397s
user    0m2.993s
sys     0m0.231s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,hf_tokeniser,model_family=llama2 --- , get_path                        /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
total 2312
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .
drwxr-xr-x 67 mmirkina users    4096 Aug 28 06:08 ..
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .cache
-rw-r--r--  1 mmirkina users     969 Aug 28 06:08 data_axs.json
-rw-r--r--  1 mmirkina users    1618 Aug 28 06:08 tokenizer_config.json
-rw-r--r--  1 mmirkina users 1842767 Aug 28 06:08 tokenizer.json
-rw-r--r--  1 mmirkina users  499723 Aug 28 06:08 tokenizer.model
maria-18-git commented 2 months ago

2. Download dataset

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,dataset_name=openorca
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloa
ded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmi
rkina/work_collection/downloaded_openorca_mlperf_dataset --num_total_samples=24576
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 831.7050881385803 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv

...
['^', 'byname', 'downloaded_openorca_mlperf_dataset']

real    15m36.378s
user    14m44.304s
sys     0m38.590s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,dataset_name=openorca , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
total 3992124
drwxr-xr-x  2 mmirkina users       4096 Aug 28 06:34 .
drwxr-xr-x 70 mmirkina users       4096 Aug 28 06:18 ..
-rw-r--r--  1 mmirkina users       2861 Aug 28 06:34 data_axs.json
-rw-r--r--  1 mmirkina users    3708395 Aug 28 06:34 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r--  1 mmirkina users  163295727 Aug 28 06:33 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r--  1 mmirkina users 1203462167 Aug 28 06:34 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r--  1 mmirkina users 1996603812 Aug 28 06:33 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r--  1 mmirkina users  109943881 Aug 28 06:34 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r--  1 mmirkina users   90970516 Aug 28 06:34 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r--  1 mmirkina users  519905608 Aug 28 06:34 open_orca_gpt4_tokenized_llama.t0.pkl
maria-18-git commented 2 months ago

3. Convert llama2 pickle file to llama3 pickle file

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'converted_pickle_file_llama2_to_llama3']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3 , get_path
/local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
total 87156
drwxr-xr-x  2 mmirkina users     4096 Aug 28 10:15 .
drwxr-xr-x 74 mmirkina users     4096 Aug 28 10:15 ..
-rw-r--r--  1 mmirkina users      334 Aug 28 10:15 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 28 10:15 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
maria-18-git commented 2 months ago

Renamed an entry:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev$ axs byname convert_pickle_file_llama2_to_llama3
['^', 'byname', 'convert_pickle_file_llama2_to_llama3']
maria-18-git commented 2 months ago

Preprocess converted pickle file from llama2

['^', 'byname', 'downloaded_openorca_mlperf_dataset_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b , get_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b total 3992132 drwxr-xr-x 2 mmirkina users 4096 Aug 30 09:24 . drwxr-xr-x 77 mmirkina users 4096 Aug 30 09:48 .. -rw-r--r-- 1 mmirkina users 378 Aug 30 08:20 data_axs.json -rw-r--r-- 1 mmirkina users 3708395 Aug 30 08:35 open_orca_gpt4_tokenized_llama.calibration_1000.pkl -rw-r--r-- 1 mmirkina users 163295727 Aug 30 08:35 open_orca_gpt4_tokenized_llama.cot.pkl -rw-r--r-- 1 mmirkina users 1203462167 Aug 30 08:35 open_orca_gpt4_tokenized_llama.flan.pkl -rw-r--r-- 1 mmirkina users 1996603812 Aug 30 08:35 open_orca_gpt4_tokenized_llama.full.pkl -rw-r--r-- 1 mmirkina users 109943881 Aug 30 08:35 open_orca_gpt4_tokenized_llama.niv.pkl -rw-r--r-- 1 mmirkina users 90970516 Aug 30 08:35 open_orca_gpt4_tokenized_llama.sampled_24576.pkl -rw-r--r-- 1 mmirkina users 519905608 Aug 30 08:35 open_orca_gpt4_tokenized_llama.t0.pkl

maria-18-git commented 2 months ago

2. Convert pickle file from llama2 to llama3:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convert_pickle_file_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_openorca_dataset_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
total 87160
drwxr-xr-x  2 mmirkina users     4096 Aug 30 08:58 .
drwxr-xr-x 77 mmirkina users     4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users      402 Aug 30 08:57 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 30 08:58 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
maria-18-git commented 2 months ago

3. Preprocess llama2

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama2,variant=7b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:43 .
drwxr-xr-x 76 mmirkina users      4096 Aug 30 09:43 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:43 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:43 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 masked_tokens.bin
maria-18-git commented 2 months ago

4. Preprocess llama3

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:48 .
drwxr-xr-x 77 mmirkina users      4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:48 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:48 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 masked_tokens.bin
maria-18-git commented 2 months ago

Should add downloading tokenizers for llama3 and use it for preprocess.

  1. Download tokenizer
    mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/llm_hf_weights_recipe$ axs byquery downloaded,hf_tokeniser,model_family=llama3,variant=8b,hf_token=hf_VFQvAeybBofkPWsLDQPTgSix
    crcuMZpYAb
    ...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Meta-Llama-3-8B" --include "tokenizer*" --local-dir "/local/mnt/work
    space/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser" --local-dir-use-symlinks False --token=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
    ...
    ['^', 'byname', 'downloaded_Meta-Llama-3-8B_tokeniser']

    Then run preprocess again:

    
    mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
    ...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

maria-18-git commented 2 months ago

Results:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 12:41 .
drwxr-xr-x 79 mmirkina users      4096 Aug 30 12:41 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 12:41 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 12:41 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 masked_tokens.bin

md5sum:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
8d75c4e008272ca80b86921c3ce74c13  data_axs.json
ab2342a9d49ab1f262dda8a631c89ed3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin

Gavin's results - md5sum:

gsimpson@aus121-r760-0:~/work_collection/preprocessed_openorca_dataset_full_2024.08.09_07h56m49s$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
bc8a9916b4b544ed2bc4034ca10fede3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin
maria-18-git commented 2 months ago

only input_ids_padded.bin is different.

maria-18-git commented 2 months ago

Converted pickle file:

gsimpson@aus121-r760-0:~/datasets/llama3/openorca$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
526a7f803d9600d90b766f42b8a4ca75  open_orca_gpt4_tokenized_llama.sampled_24576.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
9abc215c84747ff248d5c3e5cec4442f  open_orca_gpt4_tokenized_llama.sampled_24576.pkl
maria-18-git commented 2 months ago

Commits: - axs2mlperf: Added model_family, variant for llama2 in dataset_openorca_mlperf_recipe Added tokeniser rule for llama3 Added model_family, variant to openorca_preprocessor - axs2qaic-dev: Added model_family and variant to convert_pickle_file_llama2_to_llama3

For Debugging: run short accuracy experiment as in reference code using converted pickle file.