Ana - Githubissues

This is the original code I created in order to connect the backend to the frontend. First i converted the demo code into three different files, one for the database retrieval from the hugging face, another to match the query to the data called keywords and the flask endpoints from the whole application called app.py.

I have created and worked on this code till the day of the presentation, the code retireves the documents by name, then it divides each document into a dictionary called paragraph_dict where it stores {index of each paragraph, chunk so the paragraph itself and the keywords of that paragraph} this is then stored in a binary file for its retrieval later on. This is done to prevent the processing of the documents every time we run the program.

The keywords.py file contains all the functions in order to match the query to the specific content of the documents, the two most important functions in this code are the populate_keywords_to_chunks_index which is in charge of iterating over the keywords and storing it in a default(dict) and the search_query is in charge of matching this keywords with the indexes of the paragraphs and the keywords from the paragraphs. So query keywords == paragraph keywords, this then chooses the best_match by seeing which keyword appears the most in the paragraphs, so that would be the correct paragraph to answer the query.

Finally we have the app.py which has all the endpoints for the application to run locally and connect it to the backend, the /ask endpoint is the one that uses the gpt-4 model in order to use the chatbot. Is the core endpoint of the chatbot, as it uses all of the main functions.

In order to help Cristina Lorenzo to create the diagram for the frontend I also implemented a graph.py fille to try to create the diagram based on the input and the output of the conversation, this still has to be completed.

Here I describe my work, all of the code is mine and I've put a lot of effort in order to make the chatbot run, however there still some things that have to be improved such as the way it can be optimized in order to retrieve the data faster from all of the documents which weight a lot due to all the lines of content. Another approach to be improved is that instead of making it from scratch, which was very hard and took a long time, use the weights of the fine tuned model made by Rama and Cristina Requena who where working on it but we were running out of time to try and implement it.

solsylph / Debate-Chatbot

Ana #26