paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.67k stars 276 forks source link

problem while inferencing the model #83

Open ra-MANUJ-an opened 10 months ago

ra-MANUJ-an commented 10 months ago

Hi All, I'm trying to do inference using galactica-6.7B model but errors have been popping up after inferencing few examples, and I'm not sure what to do. Can anyone look at them and tell?

following is the error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[24], line 13
     10 input_text = prompt
     11 input_ids = transformers_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
---> 13 outputs = transformers_model.generate(input_ids, max_new_tokens=128)
     14 decoded_output = transformers_tokenizer.decode(outputs[0]).strip()   
     16 alpaca_finetuned_examples.append(decoded_output)

File ~/third/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/third/lib/python3.11/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, max_length, min_length, do_sample, early_stopping, num_beams, temperature, penalty_alpha, top_k, top_p, typical_p, repetition_penalty, bad_words_ids, force_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, logits_processor, renormalize_logits, stopping_criteria, constraints, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, exponential_decay_length_penalty, suppress_tokens, begin_suppress_tokens, forced_decoder_ids, **model_kwargs)
   1513         raise ValueError(
   1514             f"num_return_sequences has to be 1, but is {num_return_sequences} when doing greedy search."
   1515         )
   1517     # 10. run greedy search
-> 1518     return self.greedy_search(
   1519         input_ids,
   1520         logits_processor=logits_processor,
   1521         stopping_criteria=stopping_criteria,
   1522         pad_token_id=pad_token_id,
   1523         eos_token_id=eos_token_id,
   1524         output_scores=output_scores,
   1525         return_dict_in_generate=return_dict_in_generate,
   1526         synced_gpus=synced_gpus,
   1527         **model_kwargs,
   1528     )
   1530 elif is_contrastive_search_gen_mode:
   1532     if num_return_sequences > 1:

File ~/third/lib/python3.11/site-packages/transformers/generation/utils.py:2285, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   2282 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2284 # forward pass to get next token
-> 2285 outputs = self(
   2286     **model_inputs,
   2287     return_dict=True,
   2288     output_attentions=output_attentions,
   2289     output_hidden_states=output_hidden_states,
   2290 )
   2292 if synced_gpus and this_peer_finished:
   2293     continue  # don't waste resources running the code we don't need

File ~/third/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:934, in OPTForCausalLM.forward(self, input_ids, attention_mask, head_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    931 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    933 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 934 outputs = self.model.decoder(
    935     input_ids=input_ids,
    936     attention_mask=attention_mask,
    937     head_mask=head_mask,
    938     past_key_values=past_key_values,
    939     inputs_embeds=inputs_embeds,
    940     use_cache=use_cache,
    941     output_attentions=output_attentions,
    942     output_hidden_states=output_hidden_states,
    943     return_dict=return_dict,
    944 )
    946 logits = self.lm_head(outputs[0]).contiguous()
    948 loss = None

File ~/third/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:640, in OPTDecoder.forward(self, input_ids, attention_mask, head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    637     attention_mask = torch.ones(inputs_embeds.shape[:2], dtype=torch.bool, device=inputs_embeds.device)
    638 pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
--> 640 attention_mask = self._prepare_decoder_attention_mask(
    641     attention_mask, input_shape, inputs_embeds, past_key_values_length
    642 )
    644 if self.project_in is not None:
    645     inputs_embeds = self.project_in(inputs_embeds)

File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:539, in OPTDecoder._prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length)
    535 combined_attention_mask = None
    536 if input_shape[-1] > 1:
    537     combined_attention_mask = _make_causal_mask(
    538         input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length
--> 539     ).to(inputs_embeds.device)
    541 if attention_mask is not None:
    542     # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
    543     expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
    544         inputs_embeds.device
    545     )

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

and have been using following code:

import torch
from transformers import AutoTokenizer, OPTForCausalLM

transformers_tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")
transformers_model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", torch_dtype=torch.float16, device_map="auto")

input_text = "The Transformer architecture [START_REF]"
input_ids = transformers_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = transformers_model.generate(input_ids, max_new_tokens=20)
print(transformers_tokenizer.decode(outputs[0]))
mkardas commented 10 months ago

Can you check if your prompt is shorter than 2047 tokens?

ra-MANUJ-an commented 10 months ago

@mkardas I shortened the prompt this time, it was exceeding before. It ran for fewer more iterations and then stopped giving the same error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Can there be problem with the version of pytorch, cuda or python? Is it possible to tell the versions of dependencies used?

mkardas commented 10 months ago

You can run python -m torch.utils.collect_env as well as pip list. What's the prompt's length in tokens now? By "ran for fewer more iterations" do you mean with the exact same prompt, or with appending the generations to the prompt?