joaodsmarques / LumberChunker

This repository presents the original implementation of LumberChunker: Long-Form Narrative Document Segmentation by André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li and Arlindo L. Oliveira (accepted at EMNLP 2024 Findings)
33 stars 3 forks source link

Embeddings file not found #2

Open panzhifeng opened 2 months ago

panzhifeng commented 2 months ago

Hello, author, thank you very much for providing the framework! I would like to ask you some questions. How can these files be obtained or generated? image

joaodsmarques commented 2 months ago

We generate them using openAI embeddings:

https://platform.openai.com/docs/guides/embeddings

We will submit next week some examples and a python file that shows how we generate them.

joaodsmarques commented 2 months ago

You can check our huggingface to see how the embeddings are built. In this example we used opensource embeddings, while in the paper we used openAI embeddings:

https://huggingface.co/datasets/LumberChunker/GutenQA

zhangyizhuoduanweiwei commented 2 months ago

Thank you very much for your answer! The examples you gave for embeddings in HuggingFace, I found them in the source code. However, there were no output files of those embeddings in formats such as xlsx and csv. I am a bit confused about this part and look forward to your help. I still have another question to ask you. Is the txt file in this part, path = "Project-Gutenberg-Embeddings" txtIn = f'{path}/gutenberg_list.txt', just a list of book titles? 1 2