microsoft / azurechat

🤖 💼 Azure Chat Solution Accelerator powered by Azure Open AI Service
MIT License
1.23k stars 1.19k forks source link

How to train and bring my own data? #159

Open randyaldrich opened 1 year ago

randyaldrich commented 1 year ago

I managed to deploy the solution but how do i train it and bring my own data? any link to documentation?

baptistepa commented 1 year ago

What you should do is click on the "file" option when creating a new chat and upload a file. I only tested PDF files atm. After some time the file will be uploaded and indexed allowing you to ask the model questions about it.

tsellie commented 1 year ago

I managed to deploy the solution but how do i train it and bring my own data? any link to documentation?

What you should do is click on the "file" option when creating a new chat and upload a file. I only tested PDF files atm. After some time the file will be uploaded and indexed allowing you to ask the model questions about it.

On the Azure side of things, you would create a data source in Azure Cognitive Search (e.g., Blob Storage). Then you would create an indexer that extracts searchable content from the data source and populate a search index (i.e., azure-chat). However, the "chat over file" functionality seems to limit the context to the currently uploaded file.

@thivy - Is there a way to chat with data that has been indexed outside of the azurechat application as described in the scenario above?

craigofnz commented 1 year ago

afaik, at present pre-provisioned data is better supported in the azure-search-openai-demo? A clever merge of these features would be interesting, especially with UX elements to make it clear which data sources were in use for the chat session.

riosengineer commented 1 year ago

I managed to deploy the solution but how do i train it and bring my own data? any link to documentation?

What you should do is click on the "file" option when creating a new chat and upload a file. I only tested PDF files atm. After some time the file will be uploaded and indexed allowing you to ask the model questions about it.

On the Azure side of things, you would create a data source in Azure Cognitive Search (e.g., Blob Storage). Then you would create an indexer that extracts searchable content from the data source and populate a search index (i.e., azure-chat). However, the "chat over file" functionality seems to limit the context to the currently uploaded file.

@thivy - Is there a way to chat with data that has been indexed outside of the azurechat application as described in the scenario above?

This is my experience also. It is limited directly to the direct file upload for context, it cannot (it seems from my testing) understand indexes or contexts outside of this. E.g. I have a blob storage indexed on the search service but the bot does not have context of this.

Initially the docs made me think everything was plug & play with ingesting your data, but that is a misunderstanding on my part but the docs do somewhat hint at this - for those new to this world anyway that's how it came across.

tsellie commented 11 months ago

afaik, at present pre-provisioned data is better supported in the azure-search-openai-demo? A clever merge of these features would be interesting, especially with UX elements to make it clear which data sources were in use for the chat session.

I managed to deploy the solution but how do i train it and bring my own data? any link to documentation?

What you should do is click on the "file" option when creating a new chat and upload a file. I only tested PDF files atm. After some time the file will be uploaded and indexed allowing you to ask the model questions about it.

On the Azure side of things, you would create a data source in Azure Cognitive Search (e.g., Blob Storage). Then you would create an indexer that extracts searchable content from the data source and populate a search index (i.e., azure-chat). However, the "chat over file" functionality seems to limit the context to the currently uploaded file. @thivy - Is there a way to chat with data that has been indexed outside of the azurechat application as described in the scenario above?

This is my experience also. It is limited directly to the direct file upload for context, it cannot (it seems from my testing) understand indexes or contexts outside of this. E.g. I have a blob storage indexed on the search service but the bot does not have context of this.

Initially the docs made me think everything was plug & play with ingesting your data, but that is a misunderstanding on my part but the docs do somewhat hint at this - for those new to this world anyway that's how it came across.

Just to keep this conversation going, I realized that in chat-api-data.ts there is a filter that limits relevant documents to the current user and chat thread:

const findRelevantDocuments = async (query: string, chatThreadId: string) => {
  const relevantDocuments = await similaritySearchVectorWithScore(query, 10, {
    filter: `user eq '${await userHashedId()}' and chatThreadId eq '${chatThreadId}'`,
  });

  return relevantDocuments;

If you remove this filter, then prompts within the chat with file feature will find and reference documents uploaded by the same user in other chat threads and even documents uploaded by other users. Of course, this is not the intended behavior of this feature and is actually problematic in the lens of chatting with enterprise data as documents are deleted when chat threads are deleted. Perhaps if we index documents outside of the app and customize, removing the file upload capability and this filter, functionality will be on par with azure-search-openai-demo. Aside from this capability, I find azurechat performs better and more importantly for my use case supports Azure Government endpoints.

tsellie commented 11 months ago

This isn't very elegant, but pulled together a quick proof of concept that adds a third option to the chat type selector named "Enterprise." Relevant documents within this chat type are filtered to those uploaded by dev@localhost.

coding-totoro commented 11 months ago

This isn't very elegant, but pulled together a quick proof of concept that adds a third option to the chat type selector named "Enterprise." Relevant documents within this chat type are filtered to those uploaded by dev@localhost.

Quick question, I have an index + blob setup, and the app pointed at the index in app config. Would removing the filter allow the app to use the data in that index?

The idea is basic, I dont really care about users chatting with documents, I'd much rather bring in FAQs, enterprise data, etc, into the index via blob and just point our chat app to the index

tsellie commented 11 months ago

This isn't very elegant, but pulled together a quick proof of concept that adds a third option to the chat type selector named "Enterprise." Relevant documents within this chat type are filtered to those uploaded by dev@localhost.

Quick question, I have an index + blob setup, and the app pointed at the index in app config. Would removing the filter allow the app to use the data in that index?

The idea is basic, I dont really care about users chatting with documents, I'd much rather bring in FAQs, enterprise data, etc, into the index via blob and just point our chat app to the index

In theory, yes. At least that's what I'm aiming for as well. Will likely need to have documents indexed in the same manner that this application accomplishes via API (same fields names, field types, etc.). I haven't been able to attempt this yet.

bwitzig-zen commented 9 months ago

Anyone have any luck making modifications to this to allow for a good "trained" data search?

jimmylevell commented 9 months ago

I would recommend checking out this RAG example implementation of Microsoft: https://github.com/Azure-Samples/azure-search-openai-demo

It allows to index custom data sources. The index can then be queried using a ChatGPT powered interface.

marcelo-cloudinha commented 7 months ago

Any Updates on this how to connect to Data Sources ? FAQs, enterprise data etc...

data-analytics-copilot commented 7 months ago

What is everyone doign for solutions with PDFs that have text and images? does Microsoft Azure have a solution for Images such as graphs in PDFs?

itmilos commented 7 months ago

Look at https://github.com/Azure-Samples/azure-search-openai-demo it supports html and pdf

data-analytics-copilot commented 7 months ago

Look at https://github.com/Azure-Samples/azure-search-openai-demo it supports html and pdf

Thank you, I actually do already have this solution implemented with PDFs but it only reads the text in a PDF. I would like to expand this to read text and graphs in PDFs. Any other solutions?

itmilos commented 7 months ago

https://github.com/Azure-Samples/azure-search-openai-demo is using https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence to extract data for embeddings, on ai-document-intelligence you can use custom extraction model to get data from graphs

orngeatom commented 5 months ago

The solution that i have applied is to use Prompt flow building on Azure OpenAI service. Essentially use Azure Search AI to hold RAG embeddings. Deploy the Prompt flow to an ML endpoint. Update this app - requires a lot of customization... *This is a feature that should be included that can make API calls to Azure OpenAI and Azure Endpoints. Then you can chat with your own data.