Open endomorphosis opened 3 months ago
I prefer entire text. Btw, OPEA is microservice-based, please think how to contribute to OPEA.
Okay, then the entire text :)
@kevinintel what do you think of using Late chunk for generating our full-text wiki's embeddings
https://colab.research.google.com/drive/1IIAHEomlhUAisIz1NJTdVdtq-2A0SSCS?usp=sharing\
It's specifically designed to leverage higher context window made available by recent mother Embedding models. And allows easily to capture the semantic relationships between sentences in different chunks easily.
I was thinking we could use this method to generate our full-text embeddings (it's not done for making big datasets yet)
https://github.com/HabanaAI/vllm-fork/pull/144 This week I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings, I am going to use recursive summarization and sliding window embeddings on the article text where the article text exceeds the embedding window. Ammar is producing the abstract embeddings right now
https://huggingface.co/datasets/laion/Wikipedia-X https://huggingface.co/datasets/laion/Wikipedia-X-Full
the datasets for both Abstract and full of wikipedia in 17 different languages are created. Embeddings are being run on a 3090 server. (my repo has the updated code to compile the dataset.
Please try to create PR first, Late Chunk may not better than current embedding, but you are welcome to expand the functionalities
The abstracts embeddings are still running. https://huggingface.co/datasets/laion/Wikipedia-M3
I made a repository for searching through the embeddings, but I am working on the embedding generation scripts, to recursively summarize and chunk the embeddings.
https://github.com/endomorphosis/laion-embeddings/tree/master
@endomorphosis @kevinintel
https://huggingface.co/datasets/laion/Wikipedia-M3
Wikipedia M3 is done. In this dataset, I made Abstracts' embeddings for 10 most widely spoken and active research group languages.
These languages are:
Initial focus for embeddings was North American, South American and European languages exception being Chinese. We plan to expand into Japanese and Korean in our next iteration with a different and more advanced Embedding model such as JINA AI (8K) model from Germany and JINA AI, COLBERT embedding models.
I am mentoring some college students with LAION, one of the students is working on embeddings for wikipedia, and its not yet ready to be pushed to OPEA yet, but I want to collect feedback about an issue we discussed.
Do you guys prefer to have the entire article text be in the vectordb, or would you prefer to have only the article abstract be in the vector db. Also I had asked him to follow the example on the huggingface datasets, with regards to using the hugginface FaissIndex and elasticsearch index, but I want to confirm that this is the method that works best for you guys.
@sleepingcat4 is the college student. and his WIP repository is located here https://github.com/sleepingcat4/wikidataset and his WIP dataset is here https://huggingface.co/datasets/laion/Wikipedia_11_23_BGE-M3_Embeddings (but I have told him he needs to rework both of these, so be aware that this information is going to change.