openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.76k stars 801 forks source link

ValueError: not enough values to unpack (expected 2, got 1)->for token, rank in (line.split() for line in contents.splitlines() if line) #136

Closed pandaGost closed 1 year ago

pandaGost commented 1 year ago

version:tiktoken==0.4.0

test code:

   enc = tiktoken.get_encoding("cl100k_base")
   assert enc.decode(enc.encode("hello world")) == "hello world"

    # To get the tokeniser corresponding to a specific model in the OpenAI API:
    enc = tiktoken.encoding_for_model("gpt-4")
    print(enc)

error log:

Traceback (most recent call last):  
  File "xxx/main.py", line 154, in <module>. 
    enc = tiktoken.get_encoding("cl100k_base"). 
  File "xxx/venv/lib/python3.9/site-packages/tiktoken/registry.py", line 63, in get_encoding. 
    enc = Encoding(**constructor()). 
 File "xxxt/venv/lib/python3.9/site-packages/tiktoken_ext/openai_public.py", line 64, in cl100k_base. 
    mergeable_ranks = load_tiktoken_bpe(.
  File "xxx/venv/lib/python3.9/site-packages/tiktoken/load.py", line 117, in load_tiktoken_bpe. 
    return {
  File "xxx/venv/lib/python3.9/site-packages/tiktoken/load.py", line 119, in <dictcomp>
    for token, rank in (line.split() for line in contents.splitlines() if line)
ValueError: not enough values to unpack (expected 2, got 1)

Error occurred at 'tiktoken/load.py' line 119

def load_tiktoken_bpe(tiktoken_bpe_file: str) -> dict[bytes, int]:  
   # NB: do not add caching to this function
    contents = read_file_cached(tiktoken_bpe_file)
    return {
        base64.b64decode(token): int(rank)
        for token, rank in (line.split() for line in contents.splitlines() if line)
    }

the 'tiktoken_bpe_file' is "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken" I downloaded this file and found a blank line at the end of the file,I don't know if this blank line is causing it, I hope someone can confirm it;

PS:It was good during the test yesterday, but I reported an error today

THanks!

hauntsaninja commented 1 year ago

The blank line doesn't make a difference. I knew of a way this could happen in tiktoken 0.3.x, but not with tiktoken 0.4. Does it reproduce if you export TIKTOKEN_CACHE_DIR=''?

pandaGost commented 1 year ago

The blank line doesn't make a difference. I knew of a way this could happen in tiktoken 0.3.x, but not with tiktoken 0.4. Does it reproduce if you export TIKTOKEN_CACHE_DIR=''?

In the environment variables of my project, there is no key "TIKTOKEN_CACHE_DIR" and "DATA_GYM_CACHE_DIR";

Anyway, I solved this problem:

First,Break points through debugging mode in:

# tiktoken/load.py
def read_file_cached(blobpath: str) -> bytes:
    if "TIKTOKEN_CACHE_DIR" in os.environ:
        cache_dir = os.environ["TIKTOKEN_CACHE_DIR"]
    elif "DATA_GYM_CACHE_DIR" in os.environ:
        cache_dir = os.environ["DATA_GYM_CACHE_DIR"]
    else:
        cache_dir = os.path.join(tempfile.gettempdir(), "data-gym-cache")

    if cache_dir == "":
        # disable caching
        return read_file(blobpath)

Got the value of "cache_dir"('/var/folders/f2/vnpz2j516rz3wddslckkw_2w0000gn/T/data-gym-cache'),Open this path to find a file named with a hash value,Then, delete it。

Rerun the test code, no error reported.

Thank you for your answer! @hauntsaninja

DecHzy commented 9 months ago

The blank line doesn't make a difference. I knew of a way this could happen in tiktoken 0.3.x, but not with tiktoken 0.4. Does it reproduce if you export TIKTOKEN_CACHE_DIR=''?

In the environment variables of my project, there is no key "TIKTOKEN_CACHE_DIR" and "DATA_GYM_CACHE_DIR";

Anyway, I solved this problem:

First,Break points through debugging mode in:

# tiktoken/load.py
def read_file_cached(blobpath: str) -> bytes:
    if "TIKTOKEN_CACHE_DIR" in os.environ:
        cache_dir = os.environ["TIKTOKEN_CACHE_DIR"]
    elif "DATA_GYM_CACHE_DIR" in os.environ:
        cache_dir = os.environ["DATA_GYM_CACHE_DIR"]
    else:
        cache_dir = os.path.join(tempfile.gettempdir(), "data-gym-cache")

    if cache_dir == "":
        # disable caching
        return read_file(blobpath)

Got the value of "cache_dir"('/var/folders/f2/vnpz2j516rz3wddslckkw_2w0000gn/T/data-gym-cache'),Open this path to find a file named with a hash value,Then, delete it。

Rerun the test code, no error reported.

Thank you for your answer! @hauntsaninja

it works! but I don't know why.