[Feature Request]: Add more modes to `SentenceEmbeddingOptimizer`

run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

https://docs.llamaindex.ai

MIT License

33.36k stars 4.67k forks source link

[Feature Request]: Add more modes to `SentenceEmbeddingOptimizer` #6842

Closed jon-chuang closed 9 months ago

jon-chuang commented 12 months ago

Feature Description

PruningMode:

[X] percentile_cutoff, threshold_cutoff: prunes away less relevant sections of the node
[ ] TopAndContext(before=1, after=1): returns the most relevant sentence + before, after sentences from before and after.
- May be relevant to relevance highlighting too (https://github.com/jerryjliu/llama_index/issues/6836)
[ ] TBD
[ ] Summarizer methods - https://github.com/jerryjliu/llama_index/issues/6889
[ ] Summarizer for context + high-relevance snippets

RelevanceMode:

[x] embedding - use default embedding (needs to be factored out)
[ ] BM25 - use lexical method which may be cheaper/quicker. But which may lack nuance.
[ ] ColBERT - may be slow. But we do not need the ANN part, just the bi-encoding + MaxSim (late interaction score)
[ ] Reranker - use a cross encoding model

JainVidit12 commented 12 months ago

I would like to work on this. To clarify, TopAndContext should iterate through all the nodes and return three consecutive sentences where the middle one is the most relevant (for before=1 and after=1 case)?

jon-chuang commented 12 months ago

Yes, that is correct.

The more general form is TopKAndContext(before,after,top_k).

jon-chuang commented 12 months ago

Btw @logan-markewich , don't you think that the name is a little bit strange? Is it not more NodeLengthOptimizer or NodeContextOptimizer?

logan-markewich commented 12 months ago

The name is definitely a little weird haha we could change it (assuming we keep some reference to the old name to not break peoples code)

jon-chuang commented 12 months ago

Yeah, we can do bw compat but use new name, similar to GPTVectorIndex -> VectorIndex

jon-chuang commented 12 months ago

My preference is to use cheaper (and hopefully faster) reranking methods, and also to create notebook showing how much faster it is to perform downstream tasks due to the reduced context window size without sacrificing much on accuracy (we need a RAG benchmark in this case, not just BEIR).

(Quite excited to do this. Can also help with factoring out the ranking part of ColBERT from the compression/indexing)

JainVidit12 commented 9 months ago

added pull request #7730 to add before/after context. I was not able to obtain the separator(' ' / '.') applied to split nodes, the returned text is by default separated by period ". ".