harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
MIT License
217 stars 19 forks source link

Multilinguality in WarcGPT #89

Closed psmukhopadhyay closed 1 month ago

psmukhopadhyay commented 1 month ago

I'm not reporting a problem here but rather trying to understand the behavior of WarcGPT (version 0.1.6, what I'm using) in case of a query in Bengali [the fifth most spoken language of the world, and mostly speakers are from India (West Bengal) and Bangladesh]. I have deployed Groq API and am using the llama3-70b model (multilingual as indicated in the model card). Collected and indexed a set of 300 documents in English only from 250+ sources (around 20 WARC files generated after ingesting) on aspects of Chandrayaan-3, and casually this morning issued a Bengali query "চন্দ্রযান-৩ সম্পর্কে কিছু বলুন" meaning "Tell us about Chandrayaan-3." I have observed with a great surprise that it is trying to find out the relevant text chunks, and the answer came in English, but the content is quite relevant.

image

The very next thing I did was query the same query in English to compare the vectors in these two cases. There is a mismatch in two cases (8,13,4 for Bengali query and 1,0,4 for English query).

image

My question is, how is it possible for the LLM to understand Bengali, and although there is a mismatch in document vector retrieval, how are the answers so similar in responding to Bengali and English queries?

Another piece of puzzle for me (always) is understanding the vector visualization and here I issued this command to visualize the query vector - poetry run flask visualize --questions="চন্দ্রযান-৩ সম্পর্কে কিছু বলুন" and the result is:

image

Any suggestion to improve multilinguality in WarcGPT like use of some other embedding model etc.

Regards

matteocargnelutti commented 1 month ago

Hi @psmukhopadhyay

Great questions!

Based on the information you provided, my guess would be that although e5-large-v2 is a text-similarity model optimized for english content, it may have some (minimal?) multilingual capabilities. This could explain why it was able associate your question in Bengali with content in english, and why the results differ slightly from when you ask the question in english.

I would be curious to see how intfloat/multilingual-e5-large, which is optimized for multilingual tasks, would perform in that context.

It could also be interesting to try to prompt the model to respond in Bengali when the question is asked in that language. You could do so either directly through your question, or by modifying WARC-GPT's base prompt (for example, via TEXT_COMPLETION_BASE_PROMPT in .env)

My guess as to why the responses look so similar in the examples you shared is that there is significant overlap in the sources provided to the text generation model, but also that the question is "the same", from the perspective of the model.

Finally, regarding the vector store visualization: the main issue here is that the script picked the same (?) color for one of the sources and "QUESTIONS", which makes it hard to read. I believe the visualization script can be tweaked to help with that.

Have a great day,

psmukhopadhyay commented 1 month ago

Hello @matteocargnelutti

Thanks for your excellent suggestions. It worked like a charm.

image

Query -"চন্দ্রযান-৩ প্রকল্প ভারতের জন্য অন্য প্রকল্পগুলির তুলনায় কেন অনন্য?" in English "Why Chandrayaan-3 project is unique for India compared to other projects?"

Answer - "চন্দ্রযান-৩ প্রকল্প ভারতের জন্য অন্য প্রকল্পগুলির তুলনায় অনন্য কারণ এই প্রকল্পের উদ্দেশ্য ছিল চাঁদে মহাকাশযান অবতরণ করা, যা ভারতের জন্য একটি ঐতিহাসিক ঘটনা। চন্দ্রযান-৩ প্রকল্পের সাফল্যের মাধ্যমে, ভারত চাঁদে মহাকাশযান অবতরণ করা দেশের তালিকায় চতুর্থ স্থানে নাম লেখায়।

এছাড়াও, চন্দ্রযান-৩ প্রকল্পের উদ্দেশ্য ছিল চাঁদের ভাসমান অবস্থা, জীবনধারা এবং তাপমাত্রা সহ বিভিন্ন ক্ষেত্রে গবেষণা করা। এই গবেষণার ফলে, চন্দ্রযান-৩ প্রকল্পের সাফল্যের মাধ্যমে, ভারত চাঁদের গবেষণা এবং অন্বেষণের ক্ষেত্রে একটি নতুন মাত্রা পরিচালন করে।

সুতরাং, চন্দ্রযান-৩ প্রকল্প ভারতের জন্য অন্য প্রকল্পগুলির তুলনায় অনন্য কারণ এই প্রকল্পের উদ্দেশ্য ছিল চাঁদে মহাকাশযান অবতরণ করা, যা ভারতের জন্য একটি ঐতিহাসিক ঘটনা।" in English "The Chandrayaan-3 project is unique for India as compared to other projects as the objective of the project was to land a spacecraft on the Moon, a historic event for India. With the success of the Chandrayaan-3 project, India became the fourth country to land a spacecraft on the Moon.

Also, the Chandrayaan-3 project was intended to study various aspects of the Moon, including its buoyancy, habitability and temperature. As a result of this research, with the success of the Chandrayaan-3 project, India embarked on a new dimension in lunar research and exploration.

Thus, the Chandrayaan-3 project is unique for India as compared to other projects as the objective of this project was to land a spacecraft on the Moon, which is a historic event for India."

As advised I've installed version 0.1.7 (as a separate instance of WarcGPT) and configured .env files as below:

  1. Embedding: VECTOR_SEARCH_SENTENCE_TRANSFORMER_MODEL=intfloat/multilingual-e5-large

  2. Text completion base prompt:

TEXT_COMPLETION_BASE_PROMPT = " {history}

You are a helpful assistant.

{rag}

Request: {request}

Query language can be Bengali or English or Hindi. Always provide answer in the language of the query and try to provide helpful response (plain text, no markdown): " I tried with qwen2, llama3.1 (70B and 8B) in local Ollama and with Groq cloud, and found that llama3.1 series (both 70b and 8B) models are working more efficiently for Bengali and Hindi.

Next I'm planning a single instance with all three languages (EN, BN and HI).

Thanks a lot and please let me know if any further suggestions coming to your mind or if you wish me to test some other embedding and LLM models.

Regards