huggingface / chat-ui

Open source codebase powering the HuggingChat app
https://huggingface.co/chat
Apache License 2.0
7.38k stars 1.08k forks source link

PDF Support #1505

Open ArthurGoupil opened 1 week ago

ArthurGoupil commented 1 week ago

Hello!

I'm exploring the implementation of PDF upload support. The idea is to set the needed global config and then implement it in Vertex (this would be doable on other endpoints as well).

From my understanding, what I have to do:

I would be happy to take your suggestions/corrections.

nsarrazin commented 6 days ago

This is a great idea! Do you plan to upload PDFs directly to the vertex API? I think it supports it but other endpoints don´t.

We'll need to handle parsing/chunking/retrieval for other endpoints but I think that might be out of scope for what you wanted to implement with vertex. :)

I think the steps you outlined should be good! Maybe, documentParsing is a better name than enablePdf if we can also support CSV, markdown, HTML and other document formats? Not sure if vertex supports it

bpawnzZ commented 5 days ago

what i'm working on implementing and the plan all laid out. I pretty much got it right. Just a couple UI issues. I to think think this project majorly deserves an upgrade

Implementing KV Token Caching

Step-by-step Instructions

# Implementing KV Token Caching

To add KV (Key-Value) token caching to the chat-ui project, follow these steps:

## 1. Create `tokenCache.ts`

a. Create a new file called `tokenCache.ts` in the `src/utils` directory.

b. Implement a simple key-value cache using JavaScript's Map object:

```typescript
export class TokenCache {
  private cache: Map<string, string>;
  private maxSize: number;

  constructor(maxSize: number = 1000) {
    this.cache = new Map();
    this.maxSize = maxSize;
  }

  set(key: string, value: string): void {
    if (this.cache.size >= this.maxSize) {
      const oldestKey = this.cache.keys().next().value;
      this.cache.delete(oldestKey);
    }
    this.cache.set(key, value);
  }

  get(key: string): string | undefined {
    return this.cache.get(key);
  }

  has(key: string): boolean {
    return this.cache.has(key);
  }
}

c. Modify the src/lib/tokenizer.ts file to use the TokenCache:

import { TokenCache } from '../utils/tokenCache';

const tokenCache = new TokenCache();

export async function tokenize(text: string): Promise<string[]> {
  if (tokenCache.has(text)) {
    return tokenCache.get(text)!.split(' ');
  }

 // Existing tokenization logic...
 const tokens = /* ... */;

 tokenCache.set(text, tokens.join(' '));
 return tokens;
}

Implementing Milvus Vector Database Connection and RAG Feature

Step-by-step Instructions

a. Install required dependencies:

npm install @zilliz/milvus2-sdk-node transformers

b. Create a new file src/lib/milvusClient.ts:

import { MilvusClient } from '@zilliz/milvus2-sdk-node';

const milvusClient = new MilvusClient('http://100.71.229.63:19530');

export default milvusClient;

c. Create a new file src/lib/embeddingModel.ts:

import { pipeline } from 'transformers';

const embeddingModel = await pipeline('feature-extraction', 'Xenova/gte-small');

export async function getEmbedding(text: string): Promise<number[]> {
 const result = await embeddingModel(text);
 return result[0];
}

d. Create a new file src/lib/ragSearch.ts:

import milvusClient from './milvusClient';
import { getEmbedding } from './embeddingModel';

const COLLECTION_NAME = 'documents';

export async function searchSimilarDocuments(query: string, topK: number = 5): Promise<string[]> {
 const queryEmbedding = await getEmbedding(query);

 const searchResults = await milnusClient.search({
   collection_name : COLLECTION_NAME,
   vector : queryEmbedding,
   limit : topK,
   output_fields : ['content'],
 });

 return searchResults.map(result => result.content);
}

e .Modify src/components/Chat.tsx to add the RAG toggle and functionality :

 import { useState } from 'react';
 import { searchSimilarDocuments } from '../lib/ragSearch';

 export function Chat() {

     const [isRagEnabled , setIsRagEnabled] =
     useState(false); 

     //... other existing state variables 

     const handleSubmit =
     async(e : React.FormEvent ) => {

         e.preventDefault(); 
         //... existing code 

         if(isRagEnabled){
             const relevantDocs=await searchSimilarDocuments(userInput); 
             const context=relevantDocs.join('\n\n'); 
             userInput=`Context:${context}\n\nUser question:${userInput}`;
          }

        //... rest of the existing handleSubmit logic 
      };

      return (
        <div>
            {/* ... existing chat UI components */}
            <div>
                <label>
                    <input type="checkbox" checked={isRagEnabled} onChange={(e)=>setIsRagEnabled(e.target.checked)} />
                    RAG on/off  
                </label>   
            </div>  
            {/* ... rest of the existing chat UI components */}
        </div>
       );
     }

f . Update src/pages/api/chat.js to handle RAG-enhanced input :

 import{searchSimilarDocuments}from'../../lib/ragSearch'; 

 export defaultasyncfunction handler(req , res){ 

 let userMessage=req.body.messages[req.body.messages.length -1].content; 

 if(userMessage.startsWith('Context:' ){ 

// RAG is enabled , process input accordingly 
           const [context , question]=userMessage.split('User question:' ); 

           // Use context and question in your chat logic     
           //...  

}else{ 

// Regular chat processing    
//...
}

       //...rest of existing handler logic  
}

Adding A Paperclip Icon for Database Uploads and Status Notifications

Steps To Follow

a.Install required icon library :

 npm install react-icons  

b.Modify src/components/Chat.js component by including paperclip icon & upload functionality :

 import{useState}from'react';   
 import{searchSimilarDocuments}from'../lib/ragSearch';   
 import{FaPaperclip}from'react-icons/fa';   
 import{addDocumentToDatabase}from'../lib/databaseUpload';   

 exportfunction Chat(){      
   ...
}

return(       
<div>         
{/*existingchatUIcomponents*/}

<divclassName="flexitems-center space-x-2">           
<labelclassName="cursor-pointer">              
<inputtype="checkbox" checked={isRagEnabled} onChange={(e)=>setIsRagEnabled(e.target.checked)} />               
RAGon/off             
</label>           
<labelclassName="cursor-pointer">             
<FaPaperclip className="text-gray-500 hover:text-gray-700"/>            
<input type="file" className ="hidden" onChange={handleFileUpload accept=".txt,.pdf,.doc,.docx"/>          
</label>         
</div>

          {/*Uploadstatusnotifications*/}
          {(uploadStatus)&&(<divclassName ="mt-2 text-sm text-gray-600">{uploadStatus}</ div>)}

{/*restofexistingchatUIcomponents*/}
       </ div >
)
}

// Additional Code Here ...

c.Create anewfile databaseUpload at src/lib/databaseUpload :

 importmilusclientfrom'./milusclient';
 import{
getEmbedding

...

Ensure Collection Exists Functionality...