Closed ThatsJustCheesy closed 8 months ago
š¤
That's great to hear that you're interested in contributing to LangChain! Adding support for a variety of languages to LanguageParser would definitely be a valuable addition to the framework. It's wonderful that you have a proposed solution in mind by integrating a generic parsing library like tree-sitter.
I encourage you to go ahead and submit a pull request for this feature request. The maintainers will review your contribution and provide feedback. Thank you for your willingness to contribute to the LangChain project!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot Hi again. We're just wondering if you are able to identify any past issues similar to this one? We would like to make sure our work addresses any previous concerns in this area.
š¤
Hello,
Thank you for your interest in contributing to the LangChain project. Your idea of integrating a generic parsing library like tree-sitter to add support for a variety of languages sounds promising.
Based on the information I have, there have been no previous attempts to integrate a generic parsing library like tree-sitter into the LangChain project. However, the current language support in the LangChain project includes a wide range of languages, including C++, Go, Java, Kotlin, JavaScript, TypeScript, PHP, Protocol Buffers, Python, reStructuredText, Ruby, Rust, Scala, Swift, Markdown, LaTeX, HTML, Solidity, and C#.
The current method used for parsing different languages in the LangChain project is through the LanguageParser
class. This class uses the respective programming language syntax to parse the code. It loads each top-level function and class in the code into separate documents and generates an extra document containing the remaining top-level code that excludes the already segmented functions and classes.
The LanguageParser
class uses a dictionary LANGUAGE_SEGMENTERS
to map the language to its respective segmenter class. Currently, Python and JavaScript are supported, and their segmenters are PythonSegmenter
and JavaScriptSegmenter
respectively.
If you plan to integrate a generic parsing library like tree-sitter, you would likely need to modify the LanguageParser
class and the LANGUAGE_SEGMENTERS
dictionary to accommodate the new parsing method. You would also need to ensure that the new parsing method can handle the segmentation of code into functions and classes, as this is a key feature of the current parsing method.
Unfortunately, I couldn't find any specific method for adding a new language to the LangChain project. However, based on the current structure of the project, it seems that you would need to add the new language to the Language
Enum class and provide a corresponding segmenter class.
I hope this information helps you in your contribution. If you have any more questions or need further clarification, feel free to ask.
Best regards.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot Hey there. From the languages that LangChain supports but does not currently have a language parser for, which are the most commonly used?
Hey @Megabear137 as of now Dosu only responds to the author of the thread to reduce noise. We're working on a feature to let users "fork" issues for personal use.
Let me know if you have any other ideas/feedback!
@dosu-bot Hey there. From the languages that LangChain supports but does not currently have a language parser for, which are the most commonly used?
š¤
Hello,
Thank you for your interest in contributing to the LangChain project. Your idea of expanding the LanguageParser to support more languages is a great initiative.
As per your request, I would like to provide some information on how to add a new language to the LanguageParser. The process involves creating a new segmenter for the language you want to add. This segmenter should be a class that can parse the syntax of the new language and should have methods is_valid()
, extract_functions_classes()
, and simplify_code()
. Once the segmenter is created, you need to add a mapping from the new language to the new segmenter in the LANGUAGE_SEGMENTERS
dictionary and a mapping from the file extension of the new language to the new language in the LANGUAGE_EXTENSIONS
dictionary. You can find more details about this process in the LangChain repository.
Regarding your question about the most commonly used languages that are supported by LangChain but do not currently have a language parser, I'm afraid I couldn't find specific information about this in the repository. However, you might want to consider adding support for languages like C++, Java, and Ruby, which are widely used in the software development industry.
Please note that while I strive to provide accurate and up-to-date information, there may be a better answer that I wasn't able to find. If you need further assistance or have any other questions, feel free to ask.
Best regards, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@Megabear137 TIOBE index or similar might be helpful for us to gauge that
I personally am somewhat skeptical about the usefulness of TIOBE for predicting general trends, but for such low-stakes decisions, doubt it would really hurt
The supposed current most-used languages, ignoring stuff it doesn't make sense for us to support:
We obviously might not do all of these, but a good selection would be helpful.
If anyone reading this wants support for a particular language, please comment!
Hello I need to add a LanguageParser for COBOL Can you please advise, how I can implement this?
@ehartford aha, that's one I removed from the list. I thought it would be too arcane š
If either of these grammars work as advertised:
then it should be easy enough to create a COBOL LanguageParser using the framework we're creating. If we find time (as we're doing this for a school project), we can add it to our eventual PR as well
@ThatsJustCheesy i wish to contribute for java language.I have two questions: 1.Is it fine if i use javalang instead of tree-sitter library parser?. 2.Can i raise the PR directly here and share the link or you prefer any other way.
Also,Nice initiative.Hoping to hear from you soon
Hi @Mario928, you could definitely use some other parsing library/libraries. We aren't trying to homogenize all LanguageParser
segmenter implementations, merely add new ones where gaps existed previously :)
If you want to work on a Java implementation, we'll avoid working on one. (And it would be nice to link your PR here when you open one)
I was looking to add a golang parser (in go) and wasn't sure how to integrate it as a general solution. I really like the tree-sitter idea to cover a bunch of languages at once. I'd be happy to help with this as it aligns with my interest in langchain.
Thanks for confirming @ThatsJustCheesy .I will try to implement java parser using javalang and share the PR link here in few days.
C++ parser is done on our side.
Hey! I'd like to contribute for C# if possible.
Ruby parser is done on our side.
Just wanted to mention here that I'll be working on a C Parser.
Scala parser is done.
Anyone started on a TypeScript parser?
I found your branch. I'll use your template to add a TypeScript parser.
We intend to submit our PR sometime this week
I will submit PR for the Java parser max in 4 days.So that we can then submit final PR.Sorry for delay,i have already started,just got caught up with some personal stuff.
Submitted a PR for Java language. @ThatsJustCheesy. Please verify and accept this PR.
@ThatsJustCheesy please do Solidity and unlock the power of blockchain programming for everyone. Or please provide me with the steps of how to create a solidity parser that works and i will try! thankyou.
I wonder if C/C++ support has been merged already ? @ThatsJustCheesy
@Microsvuln Our pull request is open still: #13318
Is there any update on if and when this will get merged? Would love to see additional languages... Thanks in advance!
@eickeBuecking We are waiting for a maintainer to review. If you want it prioritized, you could ping one of them in the pull request thread :)
We need swift support
Hi @rawandahmad698, I am Khushi, a 4th year student at UofT CS. Iām working with my teammates @anushak18, @ashvini8, and @ssumaiyaahmed, who are also 4th year students at UofT CS. We would like to take the initiative to work on incorporating swift support.
Feature request
LanguageParser is a parser for Document Loaders that, given source code, splits each top-level function or class into separate documents. As stated in its documentation:
We would like to add support for additional languages, such as C, C++, Rust, Ruby, Perl, and so on.
Motivation
There is an open request for "CPP" (presumably C++) support. By integrating a generic parsing library (such as tree-sitter), we could make LanguageParser work with many more languages, and thus be more generally useful.
Your contribution
We intend to submit a pull request for this issue no later than mid-November, and likely sooner.