Open G4V opened 2 months ago
hf_tokeniser
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,hf_tokeniser,model_family=llama2,variant=7b,hf_token=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
...
"/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Llama-2-7b-chat-hf" --include "tokenizer*" --local-dir "/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser" --local-dir-use-symlinks False --token=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
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/lib/python3.8/site-packages/huggingface_hub/commands/download.py:132: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
warnings.warn(
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
Downloading 'tokenizer.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.json.a6e931b92caff4c79c5c56282f1e89569a0ae558.incomplete'
Downloading 'tokenizer.model' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.incomplete'
Downloading 'tokenizer_config.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer_config.json.a0024735c8dd7afe47fe72792b2c4edaff63bd3b.incomplete'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<00:00, 363kB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer_config.json | 0.00/1.84M [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.36MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.json████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.38MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 2.52MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.model█████████████████████████████████████████| 500k/500k [00:00<00:00, 2.53MB/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.29it/s]
/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.
['^', 'byname', 'downloaded_Llama-2-7b-chat-hf_tokeniser']
real 0m6.397s
user 0m2.993s
sys 0m0.231s
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,hf_tokeniser,model_family=llama2 --- , get_path /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
total 2312
drwxr-xr-x 3 mmirkina users 4096 Aug 28 06:08 .
drwxr-xr-x 67 mmirkina users 4096 Aug 28 06:08 ..
drwxr-xr-x 3 mmirkina users 4096 Aug 28 06:08 .cache
-rw-r--r-- 1 mmirkina users 969 Aug 28 06:08 data_axs.json
-rw-r--r-- 1 mmirkina users 1618 Aug 28 06:08 tokenizer_config.json
-rw-r--r-- 1 mmirkina users 1842767 Aug 28 06:08 tokenizer.json
-rw-r--r-- 1 mmirkina users 499723 Aug 28 06:08 tokenizer.model
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,dataset_name=openorca
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloa
ded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmi
rkina/work_collection/downloaded_openorca_mlperf_dataset --num_total_samples=24576
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 831.7050881385803 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv
...
['^', 'byname', 'downloaded_openorca_mlperf_dataset']
real 15m36.378s
user 14m44.304s
sys 0m38.590s
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,dataset_name=openorca , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
total 3992124
drwxr-xr-x 2 mmirkina users 4096 Aug 28 06:34 .
drwxr-xr-x 70 mmirkina users 4096 Aug 28 06:18 ..
-rw-r--r-- 1 mmirkina users 2861 Aug 28 06:34 data_axs.json
-rw-r--r-- 1 mmirkina users 3708395 Aug 28 06:34 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r-- 1 mmirkina users 163295727 Aug 28 06:33 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r-- 1 mmirkina users 1203462167 Aug 28 06:34 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r-- 1 mmirkina users 1996603812 Aug 28 06:33 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r-- 1 mmirkina users 109943881 Aug 28 06:34 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r-- 1 mmirkina users 90970516 Aug 28 06:34 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r-- 1 mmirkina users 519905608 Aug 28 06:34 open_orca_gpt4_tokenized_llama.t0.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>
You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.
['^', 'byname', 'converted_pickle_file_llama2_to_llama3']
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3 , get_path
/local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
total 87156
drwxr-xr-x 2 mmirkina users 4096 Aug 28 10:15 .
drwxr-xr-x 74 mmirkina users 4096 Aug 28 10:15 ..
-rw-r--r-- 1 mmirkina users 334 Aug 28 10:15 data_axs.json
-rw-r--r-- 1 mmirkina users 89234160 Aug 28 10:15 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
Renamed an entry:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev$ axs byname convert_pickle_file_llama2_to_llama3
['^', 'byname', 'convert_pickle_file_llama2_to_llama3']
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/dataset_openorca_mlperf_recipe$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloaded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b --num_total_samples=24576
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 834.4697501659393 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv
Sampling 6144 from t0
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.
['^', 'byname', 'downloaded_openorca_mlperf_dataset_llama2_7b']
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b , get_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b total 3992132 drwxr-xr-x 2 mmirkina users 4096 Aug 30 09:24 . drwxr-xr-x 77 mmirkina users 4096 Aug 30 09:48 .. -rw-r--r-- 1 mmirkina users 378 Aug 30 08:20 data_axs.json -rw-r--r-- 1 mmirkina users 3708395 Aug 30 08:35 open_orca_gpt4_tokenized_llama.calibration_1000.pkl -rw-r--r-- 1 mmirkina users 163295727 Aug 30 08:35 open_orca_gpt4_tokenized_llama.cot.pkl -rw-r--r-- 1 mmirkina users 1203462167 Aug 30 08:35 open_orca_gpt4_tokenized_llama.flan.pkl -rw-r--r-- 1 mmirkina users 1996603812 Aug 30 08:35 open_orca_gpt4_tokenized_llama.full.pkl -rw-r--r-- 1 mmirkina users 109943881 Aug 30 08:35 open_orca_gpt4_tokenized_llama.niv.pkl -rw-r--r-- 1 mmirkina users 90970516 Aug 30 08:35 open_orca_gpt4_tokenized_llama.sampled_24576.pkl -rw-r--r-- 1 mmirkina users 519905608 Aug 30 08:35 open_orca_gpt4_tokenized_llama.t0.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convert_pickle_file_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>
You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.
['^', 'byname', 'downloaded_openorca_dataset_llama3_8b']
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
total 87160
drwxr-xr-x 2 mmirkina users 4096 Aug 30 08:58 .
drwxr-xr-x 77 mmirkina users 4096 Aug 30 09:48 ..
-rw-r--r-- 1 mmirkina users 402 Aug 30 08:57 data_axs.json
-rw-r--r-- 1 mmirkina users 89234160 Aug 30 08:58 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama2,variant=7b
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.
['^', 'byname', 'preprocessed_openorca_dataset_full_llama2_7b']
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
total 295020
drwxr-xr-x 2 mmirkina users 4096 Aug 30 09:43 .
drwxr-xr-x 76 mmirkina users 4096 Aug 30 09:43 ..
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:43 attention_mask.bin
-rw-r--r-- 1 mmirkina users 269 Aug 30 09:43 data_axs.json
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:43 input_ids_padded.bin
-rw-r--r-- 1 mmirkina users 98304 Aug 30 09:43 input_lengths.bin
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:43 masked_tokens.bin
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.
['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']
Path:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x 2 mmirkina users 4096 Aug 30 09:48 .
drwxr-xr-x 77 mmirkina users 4096 Aug 30 09:48 ..
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:48 attention_mask.bin
-rw-r--r-- 1 mmirkina users 269 Aug 30 09:48 data_axs.json
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:48 input_ids_padded.bin
-rw-r--r-- 1 mmirkina users 98304 Aug 30 09:48 input_lengths.bin
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 09:48 masked_tokens.bin
Should add downloading tokenizers
for llama3
and use it for preprocess.
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/llm_hf_weights_recipe$ axs byquery downloaded,hf_tokeniser,model_family=llama3,variant=8b,hf_token=hf_VFQvAeybBofkPWsLDQPTgSix
crcuMZpYAb
...
"/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Meta-Llama-3-8B" --include "tokenizer*" --local-dir "/local/mnt/work
space/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser" --local-dir-use-symlinks False --token=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
...
['^', 'byname', 'downloaded_Meta-Llama-3-8B_tokeniser']
Then run preprocess again:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
/usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.
['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']
Results:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x 2 mmirkina users 4096 Aug 30 12:41 .
drwxr-xr-x 79 mmirkina users 4096 Aug 30 12:41 ..
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 12:41 attention_mask.bin
-rw-r--r-- 1 mmirkina users 269 Aug 30 12:41 data_axs.json
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 12:41 input_ids_padded.bin
-rw-r--r-- 1 mmirkina users 98304 Aug 30 12:41 input_lengths.bin
-rw-r--r-- 1 mmirkina users 100663296 Aug 30 12:41 masked_tokens.bin
md5sum:
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b$ md5sum *
ad85eb788057c30b577515c6a0ea9dde attention_mask.bin
8d75c4e008272ca80b86921c3ce74c13 data_axs.json
ab2342a9d49ab1f262dda8a631c89ed3 input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5 input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb masked_tokens.bin
Gavin's results - md5sum:
gsimpson@aus121-r760-0:~/work_collection/preprocessed_openorca_dataset_full_2024.08.09_07h56m49s$ md5sum *
ad85eb788057c30b577515c6a0ea9dde attention_mask.bin
bc8a9916b4b544ed2bc4034ca10fede3 input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5 input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb masked_tokens.bin
only input_ids_padded.bin
is different.
Converted pickle file:
gsimpson@aus121-r760-0:~/datasets/llama3/openorca$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
526a7f803d9600d90b766f42b8a4ca75 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
9abc215c84747ff248d5c3e5cec4442f open_orca_gpt4_tokenized_llama.sampled_24576.pkl
Commits: - axs2mlperf: Added model_family, variant for llama2 in dataset_openorca_mlperf_recipe Added tokeniser rule for llama3 Added model_family, variant to openorca_preprocessor - axs2qaic-dev: Added model_family and variant to convert_pickle_file_llama2_to_llama3
For Debugging: run short accuracy experiment as in reference code using converted pickle file.