rahmansahinler1 / ragchat_local

ragchat is a chatbot that gives you most up to date iformation with resources in your documents
MIT License
1 stars 0 forks source link

Search By Section #31

Closed ozgurnsahin closed 1 month ago

ozgurnsahin commented 2 months ago

Description As we observe task-1 its seems that ragchat hallucinates sentences. To improve extracting capability we will implement searching by document section.

Workflow 1-Implement a logic to classify a sentence to determine if its a section or not. 2-Implement a logic to selecting the section sentences. 3-Boost the section sentences by outcome of the semantic search of the section. 4-Update search output vector. 5-Test outcome with same conditions.

Acceptance Criteria At least %10 improvement on the question scores that are effected by this feature (F1).

ozgurnsahin commented 2 months ago

Workflow:

1-Section Detection Create a logic to find and locate section headers in sentences via using regex, spacy matcher, font size etc. 2-Labeling sentences Label all sentences according to section findings. 3-Find most relevant section Create a Faiss search to find most semantically similar section. 4-Boost sentences Boost most relevant sentences according to section search. 5-Integrate boosting within search Create logic to integrate boosted sentences into search logics.

ozgurnsahin commented 1 month ago

Implementations that will be added with database deployment: 1-Extracting resources from section search contexts.

ozgurnsahin commented 1 month ago

Integrated boosting within search: 1-Create a boosting np.array as long as sentences list to store all fo the senteces boost coefficient 2-Manipulate the boost coefficients of most similar headers sentences 3-Multiply boost coefficients with all sentences distances 4-Sort senteces according to last boosted distances 5-Create context from top 10 sentences

Image

Obgoktas commented 1 month ago

Section search implemented with dynamic boosting logic according to distances of headers. Widen sentences improved by most semantically similar sentence. Confidence level at the end of contexts by their context tier list. Reading function changed by using Fitz with header extraction logic.

Improvements 35 tests from 70 test. 191 total points --> 228 total points. %19 Accuracy improvement.