mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.73k stars 3k forks source link

Is it possible to manually pick or exclude certain library books? #415

Closed YIN-Renlong closed 6 months ago

YIN-Renlong commented 9 months ago

First of all, I want to thank you so much for sharing this groundbreaking project! I have a library of between 50 and 100 books, but occasionally I need to locate content in only a few of them (so I have to omit some volumes). Even if I can remove the pointless books from the /docs folder, vectorizing the entire library's books will take a long time if I make re-ingest. Is it feasible to manually exclude some books from a vast collection while keeping others? Once more, many thanks!

dosubot[bot] commented 9 months ago

🤖

Hello,

I'm glad to hear that you find the gpt4-pdf-chatbot-langchain project useful!

Yes, it is indeed possible to manually select or exclude certain books from the repository without having to re-ingest the entire library. You can achieve this by modifying the DirectoryLoader instance to include a filter function that checks the file name or path before loading it.

Here's an example of how you can modify the DirectoryLoader to exclude certain files:

const directoryLoader = new DirectoryLoader(filePath, {
  '.pdf': (path) => {
    // Exclude certain files
    if (path.includes('exclude_this_file.pdf')) {
      return null;
    }
    return new PDFLoader(path);
  },
});

In this example, the DirectoryLoader will skip any files that include 'exclude_this_file.pdf' in their path. You can modify this condition to suit your needs. For example, you could check if the path matches a list of files to exclude, or if it doesn't match a list of files to include.

You can find this in the scripts/ingest-data.ts file in the repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

YIN-Renlong commented 9 months ago

🤖

Hello,

I'm glad to hear that you find the gpt4-pdf-chatbot-langchain project useful!

Yes, it is indeed possible to manually select or exclude certain books from the repository without having to re-ingest the entire library. You can achieve this by modifying the DirectoryLoader instance to include a filter function that checks the file name or path before loading it.

Here's an example of how you can modify the DirectoryLoader to exclude certain files:

const directoryLoader = new DirectoryLoader(filePath, {
  '.pdf': (path) => {
    // Exclude certain files
    if (path.includes('exclude_this_file.pdf')) {
      return null;
    }
    return new PDFLoader(path);
  },
});

In this example, the DirectoryLoader will skip any files that include 'exclude_this_file.pdf' in their path. You can modify this condition to suit your needs. For example, you could check if the path matches a list of files to exclude, or if it doesn't match a list of files to include.

You can find this in the scripts/ingest-data.ts file in the repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best regards, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Thank you bot. Since I want to manually choose or exclude books through the browser interface (http://localhost:3000/), your technique is actually not practical for my situation. Thus, all books have been ingested and stored in vector database, but they are only withheld from the subject during browser interactive inquiry. Exist any additional remedies?

dosubot[bot] commented 9 months ago

🤖

Hello,

Based on your requirements, you can modify the DirectoryLoader instance to include a filter function that checks the file name or path against a list of files to exclude or include. Here's how you can do it:

To exclude a list of files, you can create an array of file names to exclude and then check if the current file path includes any of these names:

const excludeFiles = ['exclude_this_file.pdf', 'another_file_to_exclude.pdf', 'yet_another_file.pdf'];

const directoryLoader = new DirectoryLoader(filePath, {
  '.pdf': (path) => {
    // Exclude certain files
    for (let i = 0; i < excludeFiles.length; i++) {
      if (path.includes(excludeFiles[i])) {
        return null;
      }
    }
    return new PDFLoader(path);
  },
});

To include only a list of files, you can create an array of file names to include and then check if the current file path includes any of these names:

const includeFiles = ['include_this_file.pdf', 'another_file_to_include.pdf', 'yet_another_file.pdf'];

const directoryLoader = new DirectoryLoader(filePath, {
  '.pdf': (path) => {
    // Include only certain files
    let includeFile = false;
    for (let i = 0; i < includeFiles.length; i++) {
      if (path.includes(includeFiles[i])) {
        includeFile = true;
        break;
      }
    }
    if (!includeFile) {
      return null;
    }
    return new PDFLoader(path);
  },
});

In both examples, the DirectoryLoader will skip any files that do not meet the condition set in the filter function.

Please note that the path.includes() function checks if the file path contains the specified string anywhere in the path, not just the file name. If you have files with similar names in different directories, you may want to use a more specific method to match the file names.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 6 months ago

Hi, @YIN-Renlong,

I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog and am marking this issue as stale. From what I understand, you were seeking a way to manually exclude specific books from a large library collection when using the project's tools. Dosu-bot provided a solution involving modifying the DirectoryLoader instance to include a filter function that checks the file name or path before loading it. However, you clarified that you want to perform the exclusion through the browser interface, making the suggested technique impractical for your situation. Dosu-bot then provided further guidance on how to exclude or include specific files by modifying the DirectoryLoader instance.

Could you please confirm if this issue is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository? If it is, please let the gpt4-pdf-chatbot-langchain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you!