Open killian-mahe opened 2 weeks ago
Hey@killian-mahe Instead of using TokenTextSplitter.from_huggingface_tokenizer(),directly create an instance of TokenTextSplitter and pass the necessary parameters like:
splitter = TokenTextSplitter(
encoding_name="gpt2",
chunk_size=chunk_size,
chunk_overlap=0,
length_function=len
)
Hope this solves the problem
Hi @Swastik-Swarup-Dash,
Thanks for the help, but it doesn't seem to work as the length_function
is never used in split_text_on_tokens
. Instead, it uses the tokenizer created in the __init__
function from the encoding_name
or model_name
argument.
def split_text(self, text: str) -> List[str]:
def _encode(_text: str) -> List[int]:
return self._tokenizer.encode(
_text,
allowed_special=self._allowed_special,
disallowed_special=self._disallowed_special,
)
tokenizer = Tokenizer(
chunk_overlap=self._chunk_overlap,
tokens_per_chunk=self._chunk_size,
decode=self._tokenizer.decode,
encode=_encode,
)
return split_text_on_tokens(text=text, tokenizer=tokenizer)
Hey this is indeed a issue which I have addressed in the PR, if you want you can clone my branch and test it out but I tested my PR and it seems to be working, I ran the exact same code as you provided.
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
TokenTextSplitter
fromlangchain-text-splitters
to split my text on a specific number of tokenfrom_huggingface_tokenizer()
and uses the default onegpt-2
.System Info
aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 backoff==2.2.1 black==24.1.1 cachetools==5.5.0 certifi==2024.7.4 cffi==1.17.0 charset-normalizer==3.3.2 chevron==0.14.0 click==8.1.7 cryptography==43.0.0 dataclasses-json==0.6.7 Deprecated==1.2.14 distro==1.9.0 Events==0.5 exceptiongroup==1.2.2 fastapi==0.112.2 filelock==3.15.4 flake8==7.1.1 frozenlist==1.4.1 fsspec==2024.6.1 google-api-core==2.19.2 google-auth==2.34.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-crc32c==1.5.0 google-resumable-media==2.7.2 googleapis-common-protos==1.65.0 greenlet==3.0.3 h11==0.14.0 httpcore==1.0.5 httpx==0.27.2 huggingface-hub==0.24.6 idna==3.8 iniconfig==2.0.0 jiter==0.5.0 jsonpatch==1.33 jsonpointer==3.0.0 langchain==0.1.8 langchain-community==0.0.38 langchain-core==0.3.6 langchain-openai==0.1.7 langchain-text-splitters==0.3.0 langfuse==2.20.3 langsmith==0.1.128 lxml==5.3.0 marshmallow==3.22.0 mccabe==0.7.0 multidict==6.0.5 mypy-extensions==1.0.0 numpy==1.26.4 openai==1.42.0 openmock==3.0.1 opensearch-py==2.7.1 orjson==3.10.7 packaging==23.2 pandas==2.2.2 parameterized==0.9.0 pathspec==0.12.1 pdfminer.six==20231228 pdfplumber==0.11.4 pikepdf==9.2.0 pillow==10.4.0 platformdirs==4.2.2 pluggy==1.5.0 proto-plus==1.24.0 protobuf==5.27.4 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycodestyle==2.12.1 pycparser==2.22 pydantic==2.8.2 pydantic-settings==2.4.0 pydantic_core==2.20.1 pyflakes==3.2.0 pypdf==4.3.1 pypdfium2==4.30.0 pysqlite3-binary==0.5.3.post1 pytest==8.3.2 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.9 python-ranges==0.2.1 pytz==2024.1 PyYAML==6.0.2 regex==2024.7.24 requests==2.32.3 requests-mock==1.11.0 rsa==4.9 safetensors==0.4.4 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.32 starlette==0.38.2 tenacity==8.5.0 text-generation==0.7.0 tiktoken==0.7.0 tokenizers==0.19.1 tomli==2.0.1 tqdm==4.66.5 transformers==4.44.2 typing-inspect==0.9.0 typing_extensions==4.12.2 tzdata==2024.1 urllib3==2.2.2 uvicorn==0.30.6 wrapt==1.16.0 yarl==1.9.4