langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.11k stars 14.97k forks source link

parser=LanguageParser(language=Language.C, parser_threshold=800) error in tree_sitter_languages.core.get_language #22192

Closed liangyong928 closed 4 months ago

liangyong928 commented 4 months ago

Checked other resources

Example Code

import os
from git import Repo  # pip install gitpython
from langchain.text_splitter import Language
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.language.language_parser import LanguageParser

repo_path = "iperf"
if not os.path.exists(repo_path):
    repo = Repo.clone_from(
        "https://github.com/esnet/iperf", to_path=repo_path
    )
path_print=repo_path + "/src"
print(path_print)
loader = GenericLoader.from_filesystem(
    repo_path + "/src",
    glob="**/*",
    suffixes=[".c"],
    parser=LanguageParser(language=Language.C, parser_threshold=500),
)
documents = loader.load()
print(len(documents))  

Error Message and Stack Trace (if applicable)

& D:/Python312/python.exe f:/code/python_pj/aigc_c.py iperf/src Traceback (most recent call last): File "f:\code\python_pj\aigc_c.py", line 20, in documents = loader.load() ^^^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_core\document_loaders\base.py", line 29, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_community\document_loaders\generic.py", line 116, in lazy_load yield from self.blob_parser.lazy_parse(blob) File "D:\Python312\Lib\site-packages\langchain_community\document_loaders\parsers\language\language_parser.py", line 214, in lazy_parse if not segmenter.is_valid(): ^^^^^^^^^a^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_community\document_loaders\parsers\language\tree_sitter_segmenter.py", line 30, in is_valid language = self.get_language() ^^^^^^^^^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_community\document_loaders\parsers\language\c.py", line 30, in get_language return get_language("c") ^^^^^^^^^^^^^^^^^ File "tree_sitter_languages\core.pyx", line 14, in tree_sitter_languages.core.get_language TypeError: init() takes exactly 1 argument (2 given)

Description

I'm trying to use language=Language.C parameter to Parse C Language: loader = GenericLoader.from_filesystem( repo_path + "/src", glob="*/", suffixes=[".c"], parser=LanguageParser(language=Language.C, parser_threshold=500), )

In stead, a error is currently happening: File "tree_sitter_languages\core.pyx", line 14, in tree_sitter_languages.core.get_language TypeError: init() takes exactly 1 argument (2 given)

System Info

F:>pip freeze absl-py==2.1.0 aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.3.0 asgiref==3.8.1 asttokens==2.4.1 astunparse==1.6.3 attrs==23.2.0 backoff==2.2.1 bcrypt==4.1.3 build==1.2.1 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 chromadb==0.5.0 click==8.1.7 cloudpickle==3.0.0 colorama==0.4.6 coloredlogs==15.0.1 comm==0.2.1 contourpy==1.2.0 cycler==0.12.1 dataclasses-json==0.6.6 debugpy==1.8.0 decorator==5.1.1 Deprecated==1.2.14 eli5==0.13.0 executing==2.0.1 fastapi==0.110.3 filelock==3.13.3 flatbuffers==24.3.25 fonttools==4.47.0 frozenlist==1.4.1 fsspec==2024.3.1 gast==0.5.4 gitdb==4.0.11 GitPython==3.1.43 google-auth==2.29.0 google-pasta==0.2.0 googleapis-common-protos==1.63.0 graphviz==0.20.3 greenlet==3.0.3 grpcio==1.62.1 h11==0.14.0 h5py==3.11.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.22.2 humanfriendly==10.0 idna==3.7 importlib-metadata==7.0.0 importlib_resources==6.4.0 ipykernel==6.28.0 ipython==8.20.0 jedi==0.19.1 Jinja2==3.1.3 joblib==1.3.2 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 jupyter_client==8.6.0 jupyter_core==5.7.1 keras==3.2.1 kiwisolver==1.4.5 kubernetes==29.0.0 langchain==0.2.1 langchain-cli==0.0.23 langchain-community==0.2.1 langchain-core==0.2.1 langchain-text-splitters==0.2.0 langserve==0.2.1 langsmith==0.1.63 libclang==18.1.1 libcst==1.4.0 llvmlite==0.42.0 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.21.2 matplotlib==3.8.2 matplotlib-inline==0.1.6 mdurl==0.1.2 ml-dtypes==0.3.2 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.0.5 mypy-extensions==1.0.0 namex==0.0.8 nest-asyncio==1.5.8 networkx==3.3 numba==0.59.1 numpy==1.26.4 oauthlib==3.2.2 ollama==0.2.0 onnxruntime==1.18.0 opentelemetry-api==1.24.0 opentelemetry-exporter-otlp-proto-common==1.24.0 opentelemetry-exporter-otlp-proto-grpc==1.24.0 opentelemetry-instrumentation==0.45b0 opentelemetry-instrumentation-asgi==0.45b0 opentelemetry-instrumentation-fastapi==0.45b0 opentelemetry-proto==1.24.0 opentelemetry-sdk==1.24.0 opentelemetry-semantic-conventions==0.45b0 opentelemetry-util-http==0.45b0 opt-einsum==3.3.0 optree==0.11.0 orjson==3.10.3 overrides==7.7.0 packaging==23.2 pandas==2.2.1 parso==0.8.3 patsy==0.5.6 pgmpy==0.1.25 pillow==10.2.0 platformdirs==4.1.0 posthog==3.5.0 prompt-toolkit==3.0.43 protobuf==4.25.3 psutil==5.9.7 pure-eval==0.2.2 pyasn1==0.6.0 pyasn1_modules==0.4.0 pydantic==2.7.1 pydantic_core==2.18.2 pygame==2.5.2 Pygments==2.17.2 pyparsing==3.1.1 PyPika==0.48.9 pyproject-toml==0.0.10 pyproject_hooks==1.1.0 pyreadline3==3.4.1 python-dateutil==2.8.2 python-dotenv==1.0.1 pytz==2024.1 pywin32==306 PyYAML==6.0.1 pyzmq==25.1.2 referencing==0.35.1 regex==2023.12.25 requests==2.31.0 requests-oauthlib==2.0.0 rich==13.7.1 rpds-py==0.18.1 rsa==4.9 safetensors==0.4.2 scikit-learn==1.4.2 scipy==1.12.0 setuptools==69.2.0 shap==0.45.0 shellingham==1.5.4 six==1.16.0 slicer==0.0.7 smmap==5.0.1 sniffio==1.3.1 SQLAlchemy==2.0.30 sse-starlette==1.8.2 stack-data==0.6.3 starlette==0.37.2 statsmodels==0.14.2 sympy==1.12 tabulate==0.9.0 tenacity==8.3.0 tensorboard==2.16.2 tensorboard-data-server==0.7.2 tensorflow==2.16.1 tensorflow-intel==2.16.1 termcolor==2.4.0 threadpoolctl==3.4.0 tokenizers==0.15.2 toml==0.10.2 tomlkit==0.12.5 torch==2.2.2 torch-tb-profiler==0.4.3 torchaudio==2.2.2 torchvision==0.17.2 tornado==6.4 tqdm==4.66.2 traitlets==5.14.1 transformers==4.39.3 tree-sitter==0.22.3 tree-sitter-languages==1.10.2 typer==0.9.4 typing-inspect==0.9.0 typing_extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 uvicorn==0.23.2 watchfiles==0.21.0 wcwidth==0.2.13 websocket-client==1.8.0 websockets==12.0 Werkzeug==3.0.2 wheel==0.43.0 wrapt==1.16.0 yarl==1.9.4 zipp==3.18.2 F:>python -m langchain_core.sys_info

System Information

OS: Windows OS Version: 10.0.22631 Python Version: 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]

Package Information

langchain_core: 0.2.1 langchain: 0.2.1 langchain_community: 0.2.1 langsmith: 0.1.63 langchain_cli: 0.0.23 langchain_text_splitters: 0.2.0 langserve: 0.2.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph

wulifu2hao commented 4 months ago

I think this change in py-tree-sitter change is what caused the problem . for a quick fix you could try installing tree-sitter 0.21.3 . As for a proper fix I'm not sure, maybe to lock the version of tree-sitter in the pyproject.toml of the community lib?

wulifu2hao commented 4 months ago

In fact, the pyproject file of the langchaing-community lib already restrict the tree-sitter package to be between 0.20.2 to 0.21.0 link . And it could be installed through "extended_testing"

liangyong928 commented 4 months ago
F:\>pip uninstall tree-sitter
F:\>pip install tree-sitter==0.21.3

The error has been resolved. Thanks.

liangyong928 commented 4 months ago

But, continue to run:

from langchain.text_splitter import RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.C, chunk_size=500, chunk_overlap=50
)

A new error:

D:\Python312\Lib\site-packages\tree_sitter__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead. warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning) Traceback (most recent call last): File "f:\code\python_pj\aigc_c.py", line 26, in python_splitter = RecursiveCharacterTextSplitter.from_language( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_text_splitters\character.py", line 116, in from_language separators = cls.get_separators_for_language(language) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Python312\Lib\site-packages\langchain_text_splitters\character.py", line 631, in get_separators_for_language raise ValueError( ValueError: Language Language.C is not supported! Please choose from [<Language.CPP: 'cpp'>, <Language.GO: 'go'>, <Language.JAVA: 'java'>, <Language.KOTLIN: 'kotlin'>, <Language.JS: 'js'>, <Language.TS: 'ts'>, <Language.PHP: 'php'>, <Language.PROTO: 'proto'>, <Language.PYTHON: 'python'>, <Language.RST: 'rst'>, <Language.RUBY: 'ruby'>, <Language.RUST: 'rust'>, <Language.SCALA: 'scala'>, <Language.SWIFT: 'swift'>, <Language.MARKDOWN: 'markdown'>, <Language.LATEX: 'latex'>, <Language.HTML: 'html'>, <Language.SOL: 'sol'>, <Language.CSHARP: 'csharp'>, <Language.COBOL: 'cobol'>, <Language.C: 'c'>, <Language.LUA: 'lua'>, <Language.PERL: 'perl'>, <Language.HASKELL: 'haskell'>]

madhavatreplit commented 4 months ago

Can reproduce the new ValueError: Language Language.C is not supported! error with:

RecursiveCharacterTextSplitter.from_language(
    language=Language.C, chunk_size=500, chunk_overlap=50
)

Any fixes?

W-Wuxian commented 4 months ago

Can reproduce the new ValueError: Language Language.C is not supported! error with:

RecursiveCharacterTextSplitter.from_language(
    language=Language.C, chunk_size=500, chunk_overlap=50
)

Any fixes?

Same issue and tried with language="c" instead of language=Language.C:

  1. Linux Mint 21.3
  2. langchain 0.2.1 pypi_0 pypi
  3. langchain-community 0.2.1 pypi_0 pypi
  4. langchain-core 0.2.1 pypi_0 pypi
  5. langchain-experimental 0.0.59 pypi_0 pypi
  6. langchain-text-splitters 0.2.0 pypi_0 pypi