langchain4j / langchain4j

Java version of LangChain
https://docs.langchain4j.dev
Apache License 2.0
4.07k stars 790 forks source link

[FEATURE] [💎 $400 bounty] Integrate SearchApi as a WebSearchEngine and as a Tool #1132

Open mark-searchapi opened 2 months ago

mark-searchapi commented 2 months ago

Feature Overview

Integrate SearchApi as a WebSearchEngine and as a Tool for function calling.

Requirements

Adhere to langchain4j contribution guidelines

Related work

Related issues:

Existing SearchApi integrations

Design considerations

SearchApi support not only Google Search, but 30+ other APIs such as Youtube Search, Transcripts, Bing Search (similar JSON response keys). In a statically typed language, it is probably better to have separate tools for separate APIs, but implementation could be flexible enough to extend SearchApi and easily adapt to other engines. All engines use the same HTTP GET request. The only difference is the parameters they accept. And how you want the response to be parsed.

Other notes

Bounty

There is a $400 bounty on this awarded by SearchApi to the community.

In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty.

algora-pbc commented 2 months ago

💎 SearchApi is offering a $400 bounty for this issue 👉 Got a pull request resolving this? Claim the bounty by adding @algora-pbc /claim #1132 in the PR body and joining algora.io

abhishek818 commented 2 months ago

@mark-searchapi can i get this assigned?

mark-searchapi commented 2 months ago

Given the speed at which we're moving, we don't assign issues or "give" issues to anyone.

@abhishek818 In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty.

jemiluv8 commented 2 months ago

Hello @mark-searchapi, I just started work on the SearchApi integration as a search engine and tool. If anyone else ( and @abhishek818 ) has already started work on this please let me know as soon as you can.

langchain4j commented 2 months ago

Please adhere to these two implementations: https://github.com/langchain4j/langchain4j/tree/main/web-search-engines

jemiluv8 commented 2 months ago

Yes of course.

ayewo commented 2 months ago

If anyone else ( and ...) has already started work on this please let me know as soon as you can.

@jemiluv8 I've but I've not publicly added an "attempt", yet. Since you've also started with the issue, what difference does it make if you know or not?

jemiluv8 commented 2 months ago

@ayewo. Just wanted to know so I may halt my attempt since I just started. I'm aborting my attempt now so as to not duplicate your efforts. Happy hunting ...

Ahmadkhan02 commented 2 months ago

Hi, was setting this up. Just confused as to how I am supposed to set up env like OPENAI_API_KEY.

Ahmadkhan02 commented 2 months ago

Hi, was setting this up. Just confused as to how I am supposed to set up env like OPENAI_API_KEY.

I have raised a draft PR with my current work. Need help with this to continue. Also pls assign this to me if satisfactory

ayewo commented 2 months ago

@Ahmadkhan02

The comment by @mark-searchapi above clearly says:

Given the speed at which we're moving, we don't assign issues or "give" issues to anyone.

langchain4j commented 1 month ago

@abhishek818 why do you need open ai key? In any way, you can register with openai and get it there: https://github.com/langchain4j/langchain4j?tab=readme-ov-file#how-to-get-an-api-key

ayewo commented 1 month ago

Hi @mark-searchapi Looking at the docs for the Google Search API, it seems the API will return JSON to match the type of search query. In other words, the endpoint returns differently shaped JSON depending on whether the search is for a well-known fact, a local business, a product search etc.

The docs itself lists about 34 API examples which I pulled out into the list of JSON elements below:

Since the task here is to "Integrate SearchApi as a WebSearchEngine and as a Tool for function calling", the common format between the two classes is the WebSearchResults object.

The WebSearchResults object in turn is hydrated from 3 things from a search:

"list of organic search results, information about the search, and pagination information"

Can you clarify which of the JSON elements are to scope and which ones are to be skipped for this issue?

zambrinf commented 1 month ago

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

Ahmadkhan02 commented 1 month ago

how many tests should one write for this? and also is validation of parameters based on selected 'engine' required?

mark-searchapi commented 1 month ago

@ayewo I have reviewed the Langchain4j codebase and can provide the following insights:

WebSearchEngine as ContentRetriever Integration (for RAG use case)

After https://github.com/langchain4j/langchain4j/pull/642 PR, the core implementation seems to provide:

Cons:

I think the most important part of WebSearchResults is WebSearchOrganicResult. To construct it, we iterate through the organic_results and take title, snippet and link. In this particular use case, none of the other fields seem to matter.

SearchApiWebSearchEngine should allow configuring the engine to easily swap between Google, Bing, Google News, Baidu (we will release this week) and later other search engines. The JSON structure won't change here and will always stay the same on SearchApi side as:

{
  "organic_results": [
    {
      "title": "String",
      "link": "String",
      "snippet": "String",
      ...
    },
    ...
  ]  
}

If you need error handling, rely on the error JSON key presence in the response.

After implementing SearchApi as a core WebSearchEngine, the basic tool should work out of the box:

 googleSearchApi = SearchApiWebSearchEngine.builder()
                .apiKey(System.getenv("SEARCHAPI_API_KEY"))
                .engine("google")
  WebSearchTool webSearchTool = WebSearchTool.from(googleSearchEngine);
...

Integration as a Tool (Function calling)

WebSearchTool is built into the langchain4j core and as mentioned should work outside the box.

However, it does not seem to be very useful when used as a function since it constructs the String for LLMs only based on organic_results as seen here.

I think we should create independent tool definitions and start with a couple of engines for this issue:

This way we can control the final string that is being built. For instance, in youtube_transcripts engine, we are interested only in transcript text. You can also check sample Google Search string construction and Google News example in our recent other integration. You can also add extra fields to the tools that you think might be important for LLMs.

Documentation and Examples

Throwing some ideas:

Most of the stuff (apart custom tools) is already implemented in langchain4j Google Custom Search directory. Tests contains hints on how to implement examples for documentation (how to use web search engine, content retriever, tool, etc..).

Hope the above information helps!

zambrinf commented 1 month ago

@mark-searchapi I opened a PR with the idea of adding new search api engines in the future using an interface that handles the requests and responses

mark-searchapi commented 1 month ago

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

@zambrinf, you can try using the default num parameter and verify that the total amount of organic results is greater than 0. This issue happens with the "LangChain4j" query because Google counts different elements toward the num parameter. WebSearchEngine results only consider organic results, so elements like inline videos are not included as organic results.

ayewo commented 1 month ago

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

@zambrinf, you can try using the default num parameter and verify that the total amount of organic results is greater than 0. This issue happens with the "LangChain4j" query because Google counts different elements toward the num parameter. WebSearchEngine results only consider organic results, so elements like inline videos are not included as organic results.

@mark-searchapi I’ve already implemented it as you described in my PR. Please have a look at #1215.

czelabueno commented 1 month ago

Hi @mark-searchapi and everybody, It looks as a interesting challenge and I think that SearchAPI would be helpful for a lot of users. I was who define the v1 of WebSearchEgine, I would like to clarify some premises to take into account in your PRs and provide additional info to what is already explained here

WebSearchEngine v1 was thought to retrieve organic web content where we can include (searchAPI and imagesAPI) directly from the web search engine (google, bing, yahoo) or using a wrapper solution that searches for the same thing. It does not include news, shopping, videos, maps, books, etc.) since they are another type of object. However it could be relevant to some users and could be included in the v2 that dynamically includes an object structure of other content categories.

Speaking only of the organic contents. If the integration supports URL scrapping and is capable of retrieving the complete content of the website in text, it should populate it in 'WebSearchOrganicResult.content'. If integration can only extract one snippet into its corresponding WebSearchOrganicResult.snippet field.

For now, the Tool and the WebSearchContentRetriever do not include the metadata field of either the WebSearchContentRetriever.metadata or the WebSearchOrganicResult.metadata (we did not see it necessary until now). However, it is planned in v2 to include it and each implementation can set which metadata is fetched.

You can check in this post the flow included in the v1 https://x.com/c_zela/status/1785522559791808650

I hope it gave a little more context and I welcome your ideas to include in v2

I'll put some comments on your PRs :)

Thank you!

langchain4j commented 1 month ago

@czelabueno I will then assign both PRs to you? 🙏

Heezer commented 2 weeks ago

Hello gentlemen, #1216 has been reviewed and receives a green light from my side, #1215 still requires changes. Decision about the winner, who "takes it all" 😄 is yours 😄

langchain4j commented 2 weeks ago

@Heezer thanks a lot for the reviews! 🙏

SebastjanPrachovskij commented 2 weeks ago

Hey everyone,

We (SearchApi) don't know Java well, but we will review PRs from a logical standpoint this week. 😊

ayewo commented 2 weeks ago

@SebastjanPrachovskij aha, finally 🙂. Looking forward to your feedback.

I believe I've addressed all of the issues—most of which were trivial code changes—that @Heezer pointed out during review in #1215. Thanks for your review.

czelabueno commented 1 week ago

Hello everybody! I'll have more time this week, so can I add a second review if I'm still on time @langchain4j ? I understand that there 2 PRs open so far #1215 and #1216 right?

in advance, thank you for the great job you have done in your PRs @ayewo @zambrinf 🙌🏼

ayewo commented 1 week ago

@czelabueno

I understand that there 2 PRs open so far https://github.com/langchain4j/langchain4j/pull/1215 and https://github.com/langchain4j/langchain4j/pull/1216 right?

Correct.

Nice that you are also offering to review the PRs, but no pressure if you can't make it.

Thanks!

langchain4j commented 1 week ago

@czelabueno please do if you'll have time 🙏

ayewo commented 1 week ago

Hello @langchain4j,

TL;DR

We’ve gotten to a point where you need to take a final decision between #1215 and #1216 as repo owner.

Both PRs have been open for 6+ weeks now. Deciding now would relieve stress for both PR authors and more importantly: cut the down the time needed for subsequent reviews in half i.e. @czelabueno and anyone else would only need to make time to review 1 PR instead of 2.

Please, all you have to do is spend some time going over the two PRs with your maintainer hat on. You don't have to leave individual comments.

When done, simply close the PR that will not be merged with the comment:

Closing #this-PR in favor of #that-PR due to <reasons>

Pull Request Assessment

The sponsor explicitly stated that “In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty”, but 6+ weeks after the first PR was opened there are still 2 competing PRs because the decision on which of the two PRs should be accepted has been deferred for too long. (I totally understand that this is due to the difficulty of getting a maintainer to make such an assessment.)

After looking at the current code in #1216, I have a few concerns about how the assessment itself will be made, as I will explain.

Development Feedback

@Heezer was the first to provide lots of feedback on both PRs (thanks once again for your time!) He dwelt primarily on adherence to coding conventions, java docs, test failures, linting etc.

In other words, his feedback mostly raised development concerns, it didn’t raise any concerns with respect to design i.e. things like how the library will be used by other developers and future maintainability.

Design Feedback

The PR feedback by @mark-searchapi on both PRs focused mainly on design: tight coupling, how the code will be used in practice etc.

The extensive feedback he left in #1216 had the side-effect of completely reshaping the design of that PR. Essentially what I’m saying is that the code at the start of the PR is markedly different from the code there now:

PR Initial PR Current PR (after Design Feedback)
#1215 491a655 latest
#1216 1080630 latest

If you look at commit 1080630 above, you'll see that a core part of the design uses the Factory pattern, with each search engine requiring its own enum and 3 supporting classes. This made the library very tightly coupled to the search engines it supports.

In this design, the addition of a youtube, bing, baidu search engine would require adding enums to the appropriate class, updating the Factory and of course adding 3 supporting classes for each search engine. In other words, a total of 9 new classes will need to be added to support 3 search engines.

In spite of the absence of Java expertise[^1][^2] among folks from the SearchApi team, @mark-searchapi was still able to point out the original design as being too restrictive to support multiple search engines. It also meant they would have to update the Java library each time there is an upcoming search engine.

More Similar than Different

With the expert guidance of @mark-searchapi, his code has eventually converged on the same design I started with when I opened my PR.

Both PRs are now closer in design than when they were first opened. It’s now a really close call. It can go either way.

As things stand, the decision on which PR gets merged into the langchain4j repo has changed from technical merit (objective) to feel (subjective).

By feel I mean it will now come down to which coding style is liked best rather than the original differentiator of technical merit—which PR has the least coupling that can support multiple search engines and yet to be announced search engines.

Of course I would be stating the obvious if I said I’m now at a disadvantage here: I figured out the design for minimal coupling on my own but the other PR author didn’t have to do as much work: his current design was handed down to him during design review.

Please Pick a Winner

Now that it is a close call, there’s no point in kicking the can down the road any further. The maintainers need to pick a winner.

Rather use the latest commit, I think the only equitable thing at this point is to use a point-in-time commit (prior to design review) to assess both PRs.

I think this will save everyone involved any further anguish or sense of false hope …

The sponsors can then further shape the code of the accepted PR prior to it being merged.

(If you got to the end, thanks for attending my TED talk :))

[^2]: The example above is crafted with the help of ChatGPT as I don't know enough Java myself.

langchain4j commented 1 week ago

Hi @ayewo, since I have limited time and other priorities, I will be able to provide more information only next week. Thank you for your understanding.

czelabueno commented 5 days ago

Hi @ayewo and @zambrinf ! I've added some reviews to https://github.com/langchain4j/langchain4j/pull/1216 and https://github.com/langchain4j/langchain4j/pull/1215 and they still require changes.

You've done a good job so far. We're pretty close 👍

ayewo commented 5 days ago

Hey @czelabueno, Thanks for you review🙏!

@langchain4j wont be able to decide on which PR he will be accepting until next week, so I'll wait till then before working on the changes you requested.

langchain4j commented 9 hours ago

Hi @mark-searchapi and @SebastjanPrachovskij, please let us know if you have any preference in terms of completeness/supported functionality

langchain4j commented 6 hours ago

So, after reviewing both PRs, it is indeed a very close call. In my view, both authors did a good job and deserve the bounty. So, I would either split the bounty 50/50 or make a choice using some online randomizer. What do you think?

ayewo commented 33 minutes ago

@langchain4j

In the rules defined by the sponsors, they made it clear that you will be the one to decide the winner if there are multiple PRs:

In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty.

But you are now deferring to the sponsors 🤷‍♂️?

Anyway, suggesting a 50/50 split wouldn't be fair to me AT ALL because the 2 PRs were not even close when they were opened in June. That they are close now is a much recent event.