huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.14k stars 26.58k forks source link

using SFT for finetuning Llama2, TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #27697

Closed Sosycs closed 9 months ago

Sosycs commented 10 months ago

Hello, I am experiencing the following error:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-35-bf084d2d746f>](https://localhost:8080/#) in <cell line: 15>()
     13 example_encoded = tokenizer(example)
     14 
---> 15 collator([example_encoded])

5 frames
[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py](https://localhost:8080/#) in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    423         )
    424 
--> 425         encodings = self._tokenizer.encode_batch(
    426             batch_text_or_text_pairs,
    427             add_special_tokens=add_special_tokens,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

My Code is:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}", add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> 
Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. 
Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement 
Answer: [/INST]"""

example_encoded = tokenizer(example)

collator([example_encoded])

I have tried encode plus and splitting by "\n" before tokenizing but did not solve the error.

hi-sushanta commented 10 months ago

Hi @Sosycs, I'm thoroughly impressed with your code's performance on my local machine. I strongly recommend reviewing the package version to ensure compatibility with the latest updates

Package-Version:

Transformers: 4.36.0.dev0 Trl : 0.7.4

My-Code:

from transformers.models.llama import LlamaTokenizerFast
from trl import DataCollatorForCompletionOnlyLM
tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer",)
response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}", add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> 
Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. 
Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement 
Answer: [/INST]"""
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
example_encoded = tokenizer(example)
print(collator([example_encoded]))

Output:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'input_ids': tensor([[    1,     1, 29961, 25580, 29962,  3532, 14816, 29903,  6778,  3529,
          1831,   278,  1959,  1234,   515,   278,  2183,  2999, 25186,  2729,
           373,   278,  2183, 15228, 29901,   529,   829, 14816, 29903,  6778,
         29871,    13,  2677, 29901, 27782,  7002,   338,  1790,  1134,   310,
         28310, 14826,   292, 29889,  2973,   633,  3417,   291, 29892,   697,
          7679,   289, 17204,  2750,  1790,  7679, 29889,  4989, 17037,  9946,
           633,  3417,   291,   408,   263,  7679,   260,  3774,   793,  1623,
           263, 24968, 29889, 14104,   292,  4094,  9946,   633,  3417,   291,
           372, 16229, 23150,   577,   393,   896,   289,  3427,  2750,   697,
          1790,   313, 13080,   545, 29871, 29929, 29889, 29941,   467,  3767,
           549,  8805, 29879,  4556,   633,  3417,   291,   491,  1999,   579,
           292, 11982,  2750,  7679, 28001, 29889,  9788, 29892,   278, 14890,
           297, 14751,   455,   414,  4556,   633,  3417,   291, 29889, 26005,
           778,   310,  7679, 15685,   297, 14890,   472,   278,  5970,   310,
           263, 14751, 13241, 24559,   412,  2750,   278,  7679,  2400, 29889,
           960,   366,   505,  3926, 16531, 25695, 12917,   470,   282,   774,
          7586,   515,   263,  4840, 29892,   366,   505, 16277,   287,   278,
           664,   310,   633,  3417,   291, 29889, 29871,    13, 16492, 29901,
          4989, 17037,  9946,   604,   359,   291,   491,   599,   310,   278,
          1494,  5174, 25186,  5919, 29909, 29897, 14751,   455,   414,   313,
         29933, 29897,  8401,  4799,   313, 29907, 29897,  4972,   292,  4094,
           313, 29928, 29897,  4158, 10298, 29871,    13, 22550, 29901,   518,
         29914, 25580, 29962]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100]])}
Sosycs commented 10 months ago

Thank you very much @hi-sushanta. updating the libraries worked! but I have another question regarding the labels. when I use the example as: example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Oceanography is the study of the oceans. The word oceanology might be more accurate, since ology is the study of. Graph is to write and refers to map making. But mapping the oceans is how oceanography started. More than 70% of Earths surface is covered with water. Almost all of that water is in the oceans. Scientists have visited the deepest parts of the ocean in submarines. Remote vehicles go where humans cant. Yet much of the ocean remains unexplored. Some people call the ocean the last frontier. Humans have had a big impact on the oceans. Populations of fish and other marine species have been overfished. Contaminants are polluting the waters. Global warming is melting the thick ice caps and warming the water. Warmer water expands and, along with water from the melting ice caps, causes sea levels to rise. There are many branches of oceanography. Physical oceanography is the study of water movement, like waves and ocean currents (Figure 1.13). Marine geology looks at rocks and structures in the ocean basins. Chemical oceanography studies the natural elements in ocean water. Marine biology looks at marine life. Question: Chemical oceanography is the study of the Options:(A) human pollution of ocean water (B) naturally occurring elements in ocean water (C) rising levels of ocean water (D) rocks on the ocean floor Answer: [/INST] B </s>"""

the labels are all -100 no matter what comes after the response_template. while in the origional code it works if I add the answer after the response_template. Do I need to rewrite the example in a different way?

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.