stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

retrieval langchain with turkish dataset #36

Open 4entertainment opened 8 months ago

4entertainment commented 8 months ago

i have the following code:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from transformers import AutoTokenizer, AutoModel

from silly import no_ssl_verification
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

with no_ssl_verification():
    # load the document and split it into chunks
    loader = TextLoader("paul_graham/paul_graham_essay_tr.txt")
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # create the Turkish embedding function
    # tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
    # model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
    embedding_function = SentenceTransformerEmbeddings(model_name="dbmdz/bert-base-turkish-cased")

    # load it into Chroma
    db = Chroma.from_documents(docs, embedding_function)

    # query it
    query = "Yazarın üniversiteden önce üzerinde çalıştığı iki ana şey neydi?"
    docs = db.similarity_search(query)

    # print results
    print(docs[0].page_content)

how can i fix my code to do qa retrieval with langchain with using turkish-bert embeddings? please help me.

kapusuzoglu commented 7 months ago

what is the issue? Have you tried another model?

4entertainment commented 7 months ago

Thank your for your reply @kapusuzoglu.

My main task is write a RAG in my own language. I have LLM and embeddings. Do you have any code script or documentation about this task?

And how can i compare vector databases? (like chroma, faiss etc.)