langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.56k stars 15.29k forks source link

turkish embedding #13276

Closed 4entertainment closed 7 months ago

4entertainment commented 12 months ago

Issue you'd like to raise.

i have the following code for q&a system with retrieval mechanism:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from silly import no_ssl_verification
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

with no_ssl_verification():
    # load the document and split it into chunks
    loader = TextLoader("paul_graham/paul_graham_essay.txt")
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # create the open-source embedding function
    embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    # hfemb = HuggingFaceEmbeddings()
    # load it into Chroma
    db = Chroma.from_documents(docs, embedding_function)

    # query it
    query = "What were the two main things the author worked on before college?"
    docs = db.similarity_search(query)

    # print results
    print(docs[0].page_content)

i should do this retrieval in turkish dataset. so i should use turkish embeddings. how can i do that in my code?

Suggestion:

No response

dosubot[bot] commented 12 months ago

πŸ€–

Hello @4entertainment, great to see you again! I'm diving into your issue right now. I appreciate your patience and will get back to you with a comprehensive solution shortly.

dosubot[bot] commented 12 months ago

πŸ€–

To use Turkish embeddings in your code, you need to replace the model in the SentenceTransformerEmbeddings function with a model that supports Turkish language. However, I couldn't find a specific list of available models that support Turkish in the LangChain repository.

Here's how you can modify your code:

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="your-turkish-model")

Replace "your-turkish-model" with the name of the model that supports Turkish.

Please note that the model you choose should be compatible with the SentenceTransformerEmbeddings function in LangChain. If you're unsure about which model to use, you might want to check the Sentence Transformers library's model list for a model that supports Turkish.

If you're still having trouble finding a suitable model or implementing this, could you please provide more information about the specific issues you're encountering? This will help me provide a more accurate solution.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

4entertainment commented 12 months ago

@dosu-bot

i fix the code as following:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from silly import no_ssl_verification
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

with no_ssl_verification():
    # load the document and split it into chunks
    loader = TextLoader("paul_graham/paul_graham_essay.txt")
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # create the open-source embedding function
    embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    # hfemb = HuggingFaceEmbeddings()
    # load it into Chroma
    db = Chroma.from_documents(docs, embedding_function)

    # query it
    query = "What were the two main things the author worked on before college?"
    docs = db.similarity_search(query)

    # print results
    print(docs[0].page_content)

i get following output:

"I was nervous about money, because I could sense that Interleaf was on the way down. Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky. So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined myself living frugally off the royalties and spending all my time painting. (The painting on the cover of this book, ANSI Common Lisp, is one that I painted around this time.)

The best thing about New York for me was the presence of Idelle and Julian Weber. Idelle Weber was a painter, one of the early photorealists, and I'd taken her painting class at Harvard. I've never known a teacher more beloved by her students. Large numbers of former students kept in touch with her, including me. After I moved to New York I became her de facto studio assistant.

She liked to paint on big, square canvases, 4 to 5 feet on a side. One day in late 1994 as I was stretching one of these monsters there was something on the radio about a famous fund manager. He wasn't that much older than me, and was super rich. The thought suddenly occurred to me: why don't I become rich? Then I'll be able to work on whatever I want.

Meanwhile I'd been hearing more and more about this new thing called the World Wide Web. Robert Morris showed it to me when I visited him in Cambridge, where he was now in grad school at Harvard. It seemed to me that the web would be a big deal. I'd seen what graphical user interfaces had done for the popularity of microcomputers. It seemed like the web would do the same for the internet."

bu i should get "Before college the two main things I worked on, outside of school, were writing and programming."

dosubot[bot] commented 12 months ago

πŸ€–

Thank you for providing the code snippet. It's helpful in understanding the issue you're facing. However, to further diagnose the problem, I need a bit more information.

  1. Could you please provide the version of LangChain you are currently using?
  2. Are there any error messages or warnings that appear when you run the code?
  3. Could you also share the content of "paul_graham_essay.txt"? It would be helpful to know the context in which the model is working.
  4. Have you tried using a different model or adjusting the chunk size and overlap parameters in the CharacterTextSplitter?

These details will help me understand the situation better and provide a more accurate solution.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

kapusuzoglu commented 11 months ago

Did you find a solution for this?

dosubot[bot] commented 8 months ago

Hi, @4entertainment,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue was opened by you, and it seems like you were looking to incorporate Turkish embeddings into your existing code for a Q&A system with a retrieval mechanism. I see that there was some guidance provided by dosubot and an inquiry about the status of the issue from kapusuzoglu.

Could you please let me know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on the issue to let the LangChain team know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!