michaelfeil commented 7 months ago

I am using the newest AMI image from yesterday, with optimum-neuronx 0.0.17 (https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) I have not tried using another image yet.

I am trying to evaluate AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct") while using torch.cpu I get the following output, for the neuron equivalent, I get \n\n\n\n * 512.

outputs_decoded=Sure, here is a simple implementation of the Quick Sort algorithm in Python:

'''python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater

Reproduction script:

from transformers import AutoTokenizer, AutoModelForCausalLM

def run_generationm_with_model(model_fn):
    model: AutoModelForCausalLM = model_fn()

    tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
    messages=[
        { 'role': 'user', 'content': "write a quick sort algorithm in python."}
    ]
    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
    # 32021 is the id of <|EOT|> token
    outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, num_return_sequences=1, eos_token_id=32021)
    outputs_decoded = tokenizer.decode(outputs[0][len(inputs[0]):])
    print(f"model.cls={model.__class__} and inputs {tokenizer.decode(inputs[0])} outputs_decoded={outputs_decoded}")

def model_torch():
    model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
    return model

def model_neuron():
    from optimum.neuron import NeuronModelForCausalLM
    compiler_args = {"num_cores": 2, "auto_cast_type": "f16"}
    input_shapes = {
        "batch_size": 1,
        "sequence_length": 1024,
    }
    model = NeuronModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", export=True, **compiler_args, **input_shapes)
    return model

if __name__ == "__main__":
    run_generationm_with_model(model_neuron)
    run_generationm_with_model(model_torch)

Output:

2024-01-24 23:41:57.000553:  96357  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:57.000953:  96357  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_c96948172adcf9c8b465+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:58.000084:  96357  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:58.000228:  96362  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:58.000529:  96357  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_efe464d909d639de6c33+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:58.000650:  96357  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:58.000684:  96362  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_db8524dccbb520a8be0e+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:58.000790:  96362  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:58.000876:  96365  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:59.000368:  96362  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_dbb9375e305a952606f0+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:59.000368:  96357  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_8fd6c96184408083ace9+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:59.000375:  96365  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_82d24899460a51628ab8+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:59.000467:  96362  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:59.000506:  96357  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 23:41:59.000812:  96362  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_b028d4f002a8634d6f7c+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-01-24 23:41:59.000940:  96357  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_8a228e3b0a1ed4cce775+2c2d707e/model.neff. Exiting with a successfully compiled graph.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:32021 for open-end generation.
2024-Jan-24 23:42:22.0848 95923:96572 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jan-24 23:42:22.0848 95923:96572 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
model.cls=<class 'optimum.neuron.modeling.NeuronModelForCausalLM'> and inputs <｜begin▁of▁sentence｜>You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
### Instruction:
write a quick sort algorithm in python.
### Response:
 outputs_decoded=

(@michaelfeil many \n ommitted)

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.46it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32021 for open-end generation.
model.cls=<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> and inputs <｜begin▁of▁sentence｜>You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer
### Instruction:
write a quick sort algorithm in python.
### Response:
 outputs_decoded=Sure, here is a simple implementation of the Quick Sort algorithm in Python:

``python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater_than_pivot)

# Test the function
arr = [10, 7, 8, 9, 1, 5]
print("Original array:", arr)
print("Sorted array:", quick_sort(arr))
``

This code works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.
<|EOT|>

jimburtoft commented 7 months ago

We have seen similar behavior with togethercomputer/LLaMA-2-7B-32K

You can also replicate the example above with this code:

#num_cores should be changed based on the instance.  inf2.24xlarge has 6 neuron processors (they have two cores each) so 12 total
#larger models will need more cores.  You can make your model smaller by changing fp16 to f8.  Some models may requre num_cores to be a power of 2 
compiler_args = {"num_cores": 2, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

model_to_test = "deepseek-ai/deepseek-coder-6.7b-instruct"

model = NeuronModelForCausalLM.from_pretrained(model_to_test, export=True, **compiler_args, **input_shapes) 

from optimum.neuron import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_to_test)

p = pipeline('text-generation', model, tokenizer)
p("My favorite place on earth is", max_new_tokens=64, do_sample=True, top_k=50)

Output:

Setting 'pad_token_id' to 'eos_token_id':32021 for open-end generation.
2024-Jan-30 15:04:44.0384 58491:62420 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jan-30 15:04:44.0384 58491:62420 [0] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?
[{'generated_text': 'My favorite place on earth is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is iis is is is is is is is is isis is is is is'}]

michaelfeil commented 7 months ago

Could this be the effect of Rope Scaling?