mpaepper / content-chatbot

Build a chatbot or Q&A bot of your website's content
https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/
517 stars 51 forks source link

How to get good results, including sources? #2

Closed janek closed 1 year ago

janek commented 1 year ago

Thanks for this project, it's very helpful and inspiring!

I've tried running it on your website, but the results I'm getting aren't as great as I was hoping, especially w.r.t finding sources. I'm running the chatbot version, and copypasted code from the earlier "Q&A" version, in order to get sources from there.

Example 1 - Object detection

Question: How to detect objects in images?

Chatbot result: Object detection is a task that involves finding and labeling objects of interest in an image. A common approach is to use a model such as Faster R-CNN, which uses a selective search algorithm to find regions of interest in a bottom-up manner. Each region is then cropped and fed to a CNN backbone which extracts features. Finally, support vector machines are used to determine if the features are predictive of one of the classes. Non-maximum suppression is used to get rid of duplicate detections. There are also end-to-end approaches which directly output the final object predictions when presented with an input image. Data augmentation is often used to generate more data when there is a low amount of data available. DreamBooth is a tool that is trending in social media which is similar to textual inversion in the sense that you provide images of a concept and then afterwards you can use a special prompt token to generate new images of your concept.

Q&A result: Object detection approaches usually generate many object predictions candidates either by using a neural network or by using an algorithm. Object classification is done for each candidate and regression is used to refine the bounding box coordinates.

Sources: https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/ https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/

Problem:

Instead of getting two relevant links like you did, I'm getting one relevant link, but twice.


Example 2 - Hyperparameter tuning

Question: How to tune hyperparameters?

Chatbot result: The best methods for tuning hyperparameters in machine learning vary depending on the type of model and the data set. Generally, grid search, random search, and Bayesian optimization are popular methods for hyperparameter tuning. Grid search involves systematically exploring a range of hyperparameter values, while random search involves randomly sampling from a range of hyperparameter values. Bayesian optimization is a more sophisticated approach that uses a probabilistic model to identify the best hyperparameter values.

QA result: Hyperparameter tuning can be done using PyTorch Lightning and weights & biases.

sources: https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-pytorch-lightning-and-wandb/

Followup question: Can I tune hyperparameters with fastai?

Chatbot result: Numerai is a machine learning competition platform that uses data science to predict stock market movements. It provides a data set of anonymized financial features and a tournament where data scientists can submit their models to compete for rewards. The goal is to create a model that can accurately predict the stock market movements. To do this, data scientists must use machine learning algorithms to optimize their models and tune the hyperparameters. Fastai is a deep learning library that can be used to quickly build and train models for the Numerai tournament. It also provides tools for hyperparameter tuning, such as the weights and biases library, which can be used to run sweeps to test different combinations of hyperparameters. QA result: Yes, you can tune hyperparameters with fastai.

Sources: https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-fastai-and-wandb/

Problem

The question is clearly relevant to both, but only one is being detected. Follow-up question found more with an extra keyword

mpaepper commented 1 year ago

Hi @janek

Glad you like the repo :)

So the results for Q&A with Sources really depend a lot on the entries which are sampled from the vector store.

For your result where you got the same source twice, it probably sampled two parts from the same blog entry and used some content from each part for the answer.

To get better results, you can increase the amount of entries which get pulled from the vector store.

The default is k=4, but if you push that up to k=10, then your results should already be much better.

chain = VectorDBQAWithSourcesChain.from_llm(
          llm=OpenAI(temperature=0, verbose=True), vectorstore=store, k=10)

Let me know if that helps.