langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.57k stars 15.3k forks source link

Add support for a variety of languages to LanguageParser #11229

Closed ThatsJustCheesy closed 8 months ago

ThatsJustCheesy commented 1 year ago

Feature request

LanguageParser is a parser for Document Loaders that, given source code, splits each top-level function or class into separate documents. As stated in its documentation:

This approach can potentially improve the accuracy of QA models over source code.

Currently, the supported languages for code parsing are Python and JavaScript.

We would like to add support for additional languages, such as C, C++, Rust, Ruby, Perl, and so on.

Motivation

There is an open request for "CPP" (presumably C++) support. By integrating a generic parsing library (such as tree-sitter), we could make LanguageParser work with many more languages, and thus be more generally useful.

Your contribution

We intend to submit a pull request for this issue no later than mid-November, and likely sooner.

dosubot[bot] commented 1 year ago

šŸ¤–

That's great to hear that you're interested in contributing to LangChain! Adding support for a variety of languages to LanguageParser would definitely be a valuable addition to the framework. It's wonderful that you have a proposed solution in mind by integrating a generic parsing library like tree-sitter.

I encourage you to go ahead and submit a pull request for this feature request. The maintainers will review your contribution and provide feedback. Thank you for your willingness to contribute to the LangChain project!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ThatsJustCheesy commented 1 year ago

@dosu-bot Hi again. We're just wondering if you are able to identify any past issues similar to this one? We would like to make sure our work addresses any previous concerns in this area.

dosubot[bot] commented 1 year ago

šŸ¤–

Hello,

Thank you for your interest in contributing to the LangChain project. Your idea of integrating a generic parsing library like tree-sitter to add support for a variety of languages sounds promising.

Based on the information I have, there have been no previous attempts to integrate a generic parsing library like tree-sitter into the LangChain project. However, the current language support in the LangChain project includes a wide range of languages, including C++, Go, Java, Kotlin, JavaScript, TypeScript, PHP, Protocol Buffers, Python, reStructuredText, Ruby, Rust, Scala, Swift, Markdown, LaTeX, HTML, Solidity, and C#.

The current method used for parsing different languages in the LangChain project is through the LanguageParser class. This class uses the respective programming language syntax to parse the code. It loads each top-level function and class in the code into separate documents and generates an extra document containing the remaining top-level code that excludes the already segmented functions and classes.

The LanguageParser class uses a dictionary LANGUAGE_SEGMENTERS to map the language to its respective segmenter class. Currently, Python and JavaScript are supported, and their segmenters are PythonSegmenter and JavaScriptSegmenter respectively.

If you plan to integrate a generic parsing library like tree-sitter, you would likely need to modify the LanguageParser class and the LANGUAGE_SEGMENTERS dictionary to accommodate the new parsing method. You would also need to ensure that the new parsing method can handle the segmentation of code into functions and classes, as this is a key feature of the current parsing method.

Unfortunately, I couldn't find any specific method for adding a new language to the LangChain project. However, based on the current structure of the project, it seems that you would need to add the new language to the Language Enum class and provide a corresponding segmenter class.

I hope this information helps you in your contribution. If you have any more questions or need further clarification, feel free to ask.

Best regards.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Megabear137 commented 1 year ago

@dosu-bot Hey there. From the languages that LangChain supports but does not currently have a language parser for, which are the most commonly used?

devstein commented 1 year ago

Hey @Megabear137 as of now Dosu only responds to the author of the thread to reduce noise. We're working on a feature to let users "fork" issues for personal use.

Let me know if you have any other ideas/feedback!

ThatsJustCheesy commented 1 year ago

@dosu-bot Hey there. From the languages that LangChain supports but does not currently have a language parser for, which are the most commonly used?

dosubot[bot] commented 1 year ago

šŸ¤–

Hello,

Thank you for your interest in contributing to the LangChain project. Your idea of expanding the LanguageParser to support more languages is a great initiative.

As per your request, I would like to provide some information on how to add a new language to the LanguageParser. The process involves creating a new segmenter for the language you want to add. This segmenter should be a class that can parse the syntax of the new language and should have methods is_valid(), extract_functions_classes(), and simplify_code(). Once the segmenter is created, you need to add a mapping from the new language to the new segmenter in the LANGUAGE_SEGMENTERS dictionary and a mapping from the file extension of the new language to the new language in the LANGUAGE_EXTENSIONS dictionary. You can find more details about this process in the LangChain repository.

Regarding your question about the most commonly used languages that are supported by LangChain but do not currently have a language parser, I'm afraid I couldn't find specific information about this in the repository. However, you might want to consider adding support for languages like C++, Java, and Ruby, which are widely used in the software development industry.

Please note that while I strive to provide accurate and up-to-date information, there may be a better answer that I wasn't able to find. If you need further assistance or have any other questions, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

ThatsJustCheesy commented 1 year ago

@Megabear137 TIOBE index or similar might be helpful for us to gauge that

I personally am somewhat skeptical about the usefulness of TIOBE for predicting general trends, but for such low-stakes decisions, doubt it would really hurt

ThatsJustCheesy commented 1 year ago

The supposed current most-used languages, ignoring stuff it doesn't make sense for us to support:

Expand
  • Python (done)
  • C
  • C++
  • Java
  • C#
  • JavaScript (done)
  • Visual Basic (??)
  • PHP
  • Go
  • Pascal
  • MATLAB
  • Swift
  • Fortran
  • R
  • Kotlin
  • Ruby
  • Rust
  • Ada
  • COBOL (??)
  • Lua
  • Perl
  • Julia
  • D
  • Dart
  • Haskell
  • Objective-C
  • F#

We obviously might not do all of these, but a good selection would be helpful.

If anyone reading this wants support for a particular language, please comment!

ehartford commented 1 year ago

Hello I need to add a LanguageParser for COBOL Can you please advise, how I can implement this?

ThatsJustCheesy commented 1 year ago

@ehartford aha, that's one I removed from the list. I thought it would be too arcane šŸ˜…

If either of these grammars work as advertised:

then it should be easy enough to create a COBOL LanguageParser using the framework we're creating. If we find time (as we're doing this for a school project), we can add it to our eventual PR as well

ehartford commented 1 year ago

https://github.com/langchain-ai/langchain/pull/11674

Mario928 commented 1 year ago

@ThatsJustCheesy i wish to contribute for java language.I have two questions: 1.Is it fine if i use javalang instead of tree-sitter library parser?. 2.Can i raise the PR directly here and share the link or you prefer any other way.

Also,Nice initiative.Hoping to hear from you soon

ThatsJustCheesy commented 1 year ago

Hi @Mario928, you could definitely use some other parsing library/libraries. We aren't trying to homogenize all LanguageParser segmenter implementations, merely add new ones where gaps existed previously :)

If you want to work on a Java implementation, we'll avoid working on one. (And it would be nice to link your PR here when you open one)

jaevans commented 1 year ago

I was looking to add a golang parser (in go) and wasn't sure how to integrate it as a general solution. I really like the tree-sitter idea to cover a bunch of languages at once. I'd be happy to help with this as it aligns with my interest in langchain.

Mario928 commented 1 year ago

Thanks for confirming @ThatsJustCheesy .I will try to implement java parser using javalang and share the PR link here in few days.

ThatsJustCheesy commented 1 year ago

C++ parser is done on our side.

jelalalamy commented 1 year ago

Hey! I'd like to contribute for C# if possible.

LeilaChr commented 1 year ago

Ruby parser is done on our side.

Megabear137 commented 1 year ago

Just wanted to mention here that I'll be working on a C Parser.

LeilaChr commented 1 year ago

Scala parser is done.

Harrolee commented 1 year ago

Anyone started on a TypeScript parser?

Harrolee commented 1 year ago

I found your branch. I'll use your template to add a TypeScript parser.

ThatsJustCheesy commented 1 year ago

We intend to submit our PR sometime this week

Mario928 commented 1 year ago

I will submit PR for the Java parser max in 4 days.So that we can then submit final PR.Sorry for delay,i have already started,just got caught up with some personal stuff.

Harrolee commented 1 year ago

Submitted a pr for typescript support.

Mario928 commented 12 months ago

Submitted a PR for Java language. @ThatsJustCheesy. Please verify and accept this PR.

alexrmacleod commented 10 months ago

@ThatsJustCheesy please do Solidity and unlock the power of blockchain programming for everyone. Or please provide me with the steps of how to create a solidity parser that works and i will try! thankyou.

Microsvuln commented 10 months ago

I wonder if C/C++ support has been merged already ? @ThatsJustCheesy

ThatsJustCheesy commented 10 months ago

@Microsvuln Our pull request is open still: #13318

eickeBuecking commented 10 months ago

Is there any update on if and when this will get merged? Would love to see additional languages... Thanks in advance!

ThatsJustCheesy commented 10 months ago

@eickeBuecking We are waiting for a maintainer to review. If you want it prioritized, you could ping one of them in the pull request thread :)

rawandahmad698 commented 4 months ago

We need swift support

khushiDesai commented 3 weeks ago

Hi @rawandahmad698, I am Khushi, a 4th year student at UofT CS. Iā€™m working with my teammates @anushak18, @ashvini8, and @ssumaiyaahmed, who are also 4th year students at UofT CS. We would like to take the initiative to work on incorporating swift support.