mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.84k stars 3.01k forks source link

Does it work in other languages than english? #123

Closed luisrock closed 11 months ago

luisrock commented 1 year ago

I am trying with a pdf in portuguese and the results are awful. I've updated the prompt, but still...

yinshipeng commented 1 year ago

I also encountered the same problem. I am trying to use the Chinese document, but the effect is not very good

stefansms commented 1 year ago

Can you describe it's behavior?

I translated the QA_PROMPT and CONDENSE_PROMPT to portuguese. Ingested some portuguese PDF and it's working as expected, answering in Portuguese.

However, it fails to connect one question to another, as if it stays stuck on the initial results and don't seem to "refresh" them. I don't think it is related to the translation.

luisrock commented 1 year ago

Well, my PDF has 200 pages. Must be a problem, right? The context (sources) are not at all related to the question. So I guess the problem is in the search for the right context to inject in the prompt

lucastzuka commented 1 year ago

I tried hard to make it work in Portuguese, but it didn't work :/ If you manage to make it work, please let me know

stefansms commented 1 year ago

@luisrock This may be true, but this limitation should not force the template to respond in English. In my case, I am using gpt3.5-turbo as model and have provided more than 2000 pages of PDF in Portuguese.

@lucastzuka I've just translated the prompt, as mentioned earlier. I also provided a lot of text in Portuguese.

luisrock commented 1 year ago

Again, the context injected is totally wrong. That is the main reason, I guess.

lucastzuka commented 1 year ago

@luisrock talvez de uma olhada se os pdf que vc ta usando nao estao protegidos. pra mim deu uns problemas de contexto no chat e fazendo perguntas especificas de cada documento percebi que um dos 3 pdf que eu tinha carregado estava protegido. outra coisa que vi que pode ser tambem é quando as paginas de texto do pdf estao convertidas em imagem.

@stefansms I translated the QA_PROMPT to Portuguese and added a line saying to give the answers in Portuguese. So it worked :) even when loading only documents in English. thanks s2

ahgsql commented 1 year ago

same problem for turkish, embedded context is not related to question most of time

thiagopachecoit commented 1 year ago

I think this pull request may help: https://github.com/mayooear/gpt4-pdf-chatbot-langchain/pull/77

ahgsql commented 1 year ago

Its not about answers from GPT, the problem is about to find related chunks

dosubot[bot] commented 11 months ago

Hi, @luisrock! I'm Dosu, and I'm here to help the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are experiencing poor results when using the tool with a PDF in Portuguese and you're questioning if the tool works with languages other than English. Other users, such as @yinshipeng and @lucastzuka, have also encountered similar issues with Chinese and Portuguese documents. It seems that @stefansms suggests that the problem may be related to the search for the right context to inject in the prompt. Additionally, @lucastzuka suggests checking if the PDFs being used are protected or if the text pages are converted into images. @ahgsql mentions a similar problem with Turkish, and @thiagopachecoit suggests a pull request that may help.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the project!