lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.28k stars 423 forks source link

delete_original #179

Open ayttop opened 2 months ago

ayttop commented 2 months ago

Where do I put it? delete_original = True

ayttop commented 2 months ago

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-13B", delete_original = True)

True or false?

ayttop commented 2 months ago

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-13B", delete_original = True)

or use model's local path...

model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [ 'Who is Napoleon Bonaparte?',

'I like',

]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False)

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

cache_utils installed /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: The secret HF_TOKEN does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn( Fetching 11 files: 100%  11/11 [00:00<00:00, 670.43it/s] found_layers:{'model.embed_tokens.': True, 'model.layers.0.': True, 'model.layers.1.': True, 'model.layers.2.': True, 'model.layers.3.': True, 'model.layers.4.': True, 'model.layers.5.': True, 'model.layers.6.': True, 'model.layers.7.': True, 'model.layers.8.': True, 'model.layers.9.': True, 'model.layers.10.': True, 'model.layers.11.': True, 'model.layers.12.': True, 'model.layers.13.': True, 'model.layers.14.': True, 'model.layers.15.': True, 'model.layers.16.': True, 'model.layers.17.': True, 'model.layers.18.': False, 'model.layers.19.': False, 'model.layers.20.': False, 'model.layers.21.': False, 'model.layers.22.': False, 'model.layers.23.': False, 'model.layers.24.': False, 'model.layers.25.': False, 'model.layers.26.': False, 'model.layers.27.': False, 'model.layers.28.': False, 'model.layers.29.': False, 'model.layers.30.': False, 'model.layers.31.': False, 'model.layers.32.': False, 'model.layers.33.': False, 'model.layers.34.': False, 'model.layers.35.': False, 'model.layers.36.': False, 'model.layers.37.': False, 'model.layers.38.': False, 'model.layers.39.': False, 'model.norm.': False, 'lm_head.': False} some layer splits found, some are not, re-save all layers in case there's some corruptions. 0%| | 0/43 [00:00<?, ?it/s]Loading shard 1/3 /usr/local/lib/python3.10/dist-packages/airllm/utils.py:296: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. state_dict.update(torch.load(to_load, map_location='cpu')) 37%|███▋ | 16/43 [00:50<00:09, 2.76it/s]deleting original file: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/pytorch_model-00001-of-00003.bin Loading shard 2/3 44%|████▍ | 19/43 [01:35<02:43, 6.82s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.18.safetensors 47%|████▋ | 20/43 [02:10<05:55, 15.48s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.19.safetensors 49%|████▉ | 21/43 [02:41<07:17, 19.90s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.20.safetensors 51%|█████ | 22/43 [03:10<08:00, 22.88s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.21.safetensors 53%|█████▎ | 23/43 [03:22<06:29, 19.48s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.22.safetensors 56%|█████▌ | 24/43 [03:41<06:08, 19.38s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.23.safetensors 58%|█████▊ | 25/43 [04:01<05:53, 19.64s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.24.safetensors 60%|██████ | 26/43 [04:11<04:45, 16.77s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.25.safetensors 63%|██████▎ | 27/43 [04:28<04:28, 16.78s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.26.safetensors 65%|██████▌ | 28/43 [04:43<04:02, 16.19s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.27.safetensors 67%|██████▋ | 29/43 [04:52<03:16, 14.07s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.28.safetensors 70%|██████▉ | 30/43 [04:57<02:28, 11.42s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.29.safetensors 72%|███████▏ | 31/43 [05:08<02:15, 11.31s/it]deleting original file: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/pytorch_model-00002-of-00003.bin Loading shard 3/3 Fetching 1 files: 100%  1/1 [01:07<00:00, 67.51s/it] pytorch_model-00003-of-00003.bin: 100%  6.18G/6.18G [01:07<00:00, 151MB/s] saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.30.safetensors 74%|███████▍ | 32/43 [06:51<07:04, 38.62s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.31.safetensors 77%|███████▋ | 33/43 [07:09<05:24, 32.45s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.32.safetensors 79%|███████▉ | 34/43 [07:13<03:34, 23.84s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.33.safetensors 81%|████████▏ | 35/43 [07:28<02:49, 21.18s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.34.safetensors 84%|████████▎ | 36/43 [07:34<01:58, 16.90s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.35.safetensors 86%|████████▌ | 37/43 [07:41<01:22, 13.72s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.36.safetensors 88%|████████▊ | 38/43 [07:52<01:05, 13.04s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.37.safetensors 93%|█████████▎| 40/43 [08:12<00:32, 10.97s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.38.safetensors saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.layers.39.safetensors 95%|█████████▌| 41/43 [08:22<00:21, 10.72s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/model.norm.safetensors 98%|█████████▊| 42/43 [08:22<00:07, 7.58s/it]saved as: /root/.cache/huggingface/hub/models--garage-bAInd--Platypus2-13B/snapshots/dc1024c1b9df38f57f6436a02d31706cb0deaa01/splitted_model/lm_head.safetensors 100%|██████████| 43/43 [08:24<00:00, 11.73s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [02:00<00:00, 2.79s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:56<00:00, 2.71s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [02:01<00:00, 2.82s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [02:02<00:00, 2.84s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:59<00:00, 2.78s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:59<00:00, 2.77s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.77s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.76s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:57<00:00, 2.74s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:57<00:00, 2.74s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.76s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.75s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.76s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:59<00:00, 2.77s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.75s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:59<00:00, 2.77s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.75s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [02:00<00:00, 2.80s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [01:58<00:00, 2.76s/it] new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa... attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'> running layers(cuda:0): 100%|██████████| 43/43 [02:05<00:00, 2.93s/it] Who is Napoleon Bonaparte? Napoleon Bonaparte was a French military and political leader who rose to promin

ayttop commented 2 months ago

He succeeded in Colab t4 garage-bAInd/Platypus2-13B

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-13B", delete_original = True)

ayttop commented 2 months ago