Add integration and/or system tests

mattlindsey commented 1 year ago

I think we need the ability to add and run some 'Integration' tests that exercise interactions in high level components and use actual apis and keys. They would be run only on request and could be run before each release.

Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.

mattlindsey commented 1 year ago

I'm going to try implementing a simple https://cucumber.io/ test. It might work well here, but if it doesn't add value we don't have to use it:


Feature: Chain Of Thought
  Decompose multi-step problems into intermediate steps

  Scenario: Multistep with distance calculation
    Given I want to know a difficult distance calculation
    When I ask "How many full soccer fields would be needed to cover the distance between NYC and DC in a straight line?")
    Then I should be told something like "Approximately 2,945 soccer fields"

andreibondarev commented 1 year ago

@mattlindsey Do you envision that this would actually run in CI?

I'm also struggling a bit figuring out what value these feature tests would bring to this library?

mattlindsey commented 1 year ago

If you run them in CI I think you'd catch errors sooner, like I think there's a gem dependency error now. (Might be wrong.) Also, the agents are fairly high level, so testing interaction with other things using 'integration' testing is certainly necessary somewhere, I think.

andreibondarev commented 1 year ago

I hope @technicalpickles doesn't mind that I pull him in. There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?

mattlindsey commented 1 year ago

Also see where I implemented a couple of tests here to give a better idea: https://github.com/andreibondarev/langchainrb/pull/145

mattlindsey commented 1 year ago

And for a wider range of testing it would be good if someone implemented Langchain::LLM::HuggingFace#complete.

technicalpickles commented 1 year ago

Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.

I was doing a course on deeplearning.ai, it was talking about how if you set a temperature=0, you should get the same results. The course was taught using jupyter notebooks, and the results they got doing the exercise matched what the AI was returning when I did them in the notebooks. I think it can be considered relatively stable?

There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?

Yep! Here is what I suggested:

I've been thinking about getting the code in the README and in examples to be run as part of CI. did something like that for openfeature-sdk (https://github.com/open-feature/ruby-sdk/pull/40) ... I think the challenge for the README is making sure the fragment is complete enough to run, as well as having the right environment variables to make the call.

In both cases, I'm starting to think we could get pretty far by stubbing the response from the LLM. That could help cover everything leading up to the request. The most common way I've done this is with VCR and/or webmock. The main downside there it doesn't capture changes that happen with the remote end, obviously. If we are using existing libraries to do those interactions though, it's probably a pretty good tradeoff.

mattlindsey commented 1 year ago

Thanks @technicalpickles. I'm going to try the method you used in open-feature to run our README examples with temperature=0. It will still have to be an optional script or spec, since it would require env variables - like you said.

When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services. https://github.com/andreibondarev/langchainrb/blob/9dd8add0703c8cc9f5d250ee7a3559f45053d7e3/spec/langchain/llm/openai_spec.rb#L68

andreibondarev commented 1 year ago

@mattlindsey I'm going to try implementing a simple https://cucumber.io/ test. I don't see much value in using Cucumber. In the case of web apps -- it brings a lot of value abstracting the engineer out from "clicking" through the UI. It's also useful when "QA Engineers" are primarily writing these tests because it provides them a nice DSL.

We need to figure out whether we'd like these tests to run against real (non-mocked) services, with actual API keys/creds.

If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

If not -- then these tests/scripts should be in Rspec.

We have a pretty large testing matrix: think "Num of vectorsearch DBs X Num of LLMs X", i.e. we're saying that any LLM in the project (that supports embed()) can generate embeddings for any vectorsearch DB.

@mattlindsey @technicalpickles Thoughts?

technicalpickles commented 1 year ago

When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services.

That is what I meant, yeah. I think we can get still get some value out of having everything but the LLM response, since there are plenty of other moving parts.

If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

If that is going to require providing an API key anyways, so may as well do it in plain ruby. We could even have a rspec tag to indicate something uses the API, and have that automatically included/excluded when the ENV['OPENAI_API_KEY'] is present.

describe Whatever, :openai_integration => true do
  it "works" do
     # ...
  end
end

Then run:

$ rspec --tag openai_integration

To exclude by defaut, we can add --tag ~openai_integration to the .rspec which is for default arguments.

These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

Saying that, it makes me wonder if they have any policies for open source development?

OpenAI is also on Azure, and Azure has Open Source credits we could apply to https://opensource.microsoft.com/azure-credits/

mattlindsey commented 1 year ago

@andreibondarev Can Jupyter notbooks run ruby? I'm thinking rspec in a separate 'integration' directory with the tags described by Josh sounds good.

Looks like Azure takes 3-4 weeks to reply in case you want to request to use their 'OpenAI Azure' (https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview). But would that mean a new LLM class in langchainrb? I don't see any ruby examples in the documentation so I'm not sure.

technicalpickles commented 1 year ago

Can Jupyter notbooks run ruby?

I saw it in the boxcars gem, which is in the same space as this gem: https://github.com/BoxcarsAI/boxcars/blob/main/notebooks/boxcars_examples.ipyn

mattlindsey commented 1 year ago

@technicalpickles I added a similar 'getting started' Jupyter notebook in #185, but it was somewhat difficult to get working and seems to give errors sometimes. Take a look if you want, but I don't want to waste your time!

mattlindsey commented 1 year ago

I did get a notebook working, but it's very picky and may not be worth the effort to maintain. I'll post it here just in case: https://gist.github.com/mattlindsey/5f6388d6ff76c2decdccb723bb4ed4c5#file-getting_started-ipynb

patterns-ai-core / langchainrb

Add integration and/or system tests #144