What will be the UI that will use the NLQ component?

robknapen commented 5 months ago

Some coordination between building the UI and the NLQ service might be handy.

BerkvensNick commented 4 months ago

I think for the first iteration we can integrate a UI-component in the site Paul has set up: https://soilwise-he.containers.wur.nl

I think a lot of the nicely listed issues in this repository can only really be dealt with if we have a clear use case that can convince JRC to look into using this technology. Currently you can search the SWR catalogus (pyscw) by entering search terms into the search box or ticking certain filters (see prototype by Paul), what can a LLM improve for a user when searching in this catalogus?

Go a step further and also analyse the content of the identified documents, making it available for the user
Provide search functionality in other datafields like author, date, …
Provide search functionality not only purely related to the search term, but also documents containing terms closely related to the search term, … a wider contextual search
Multilingual search functionality
… others?

maybe we can identify one extra LLM-functionality to the current catalogus and further elaborate this as a use case … and then also tackle the other issues with this?

robknapen commented 4 months ago

Ok. Although I consider the NLQ component as being a backend service that would only provide an API, for demo/prototype purposes we can add a simple web frontend using e.g. chainlit. For further integration and total customisation there is a React JS package (https://www.npmjs.com/package/@chainlit/react-client) that can talk to a chainlit based backend.

BerkvensNick commented 4 months ago

@robknapen , I see you have added some more information in the repository (thanks!!). I get the impression the functionality you currently think most feasible in the SWR is having a LLM generate a SPARQL query that is run on the Virtuso database? To identify relevant documents or to extract more specific knowledge? I think this could be a valid new functionality next to the other suggested ideas:

in summary: Currently you can search the SWR catalogus (pyscw) by entering search terms into the search box or ticking certain filters (see prototype by Paul), what can a LLM improve for a user when searching in this catalogus?

Go a step further and also analyze the content of the identified documents, making knowledge in the documents available for the user
Provide search functionality in other datafields like author, date, … (I played around a little and think PYSCW currently searches in all data fields like author and data, so this extra functionality is not relevant)
Provide search functionality not only purely related to the search term, but also documents containing terms closely related to the search term, … a wider contextual search
Multilingual search functionality (knowledge graph could then be in English and LLM provides multilingual interaction with user)
Have a LLM generate a specific SPARQL query for a user question comprising more than a keyword(s) based search (e.g. a combination of multiple datafields each with certain restrictions, e.g. a dataset about erosion and grasslands published between 2022 and 2024 and originating from Western Europe), this query is then run on the Virtuoso database providing the requested documents/knowledge

I think we have to chose one and then go further from here. I have the feeling a lot of your questions/issues can only be clarified in function of the chosen functionality. Maybe also ask opinions of Paul, Rob and some other project partners?

robknapen commented 4 months ago

@BerkvensNick It basically follows from the current architecture design. We have the LLM component using the SPARQL interface to access a triple store. This limits what can be implemented. Some of the other things can either be solved with classical NLP, or we have to think more serious about what type of embeddings we will calculate and where we will store them, to facilitate more advanced types of semantic searching (and maybe exploration of the knowledge that we harvest - but we don't have to go there).

robknapen commented 4 months ago

Please add your feedback @pvgenuchten @roblokers .

roblokers commented 4 months ago

A few thoughts from my side

translation is probably required and very useful. Should probalby be a priority functionality, but is it the most logical functionality to exploit LLM for? Should e.g. we consider and check ? And if the starting point is an EN knowledge graph, we would also need it at harvesting time, as metadata and other content will come in in different languages.
A LLM generated query on multiple fields as such sounds not very exiting (seems like that could be codable in convential ways) unless we include semantic search. From the previous conversation (noting the remark on following the current archtecture and the limitations) it seems that this might not be feasible for the short (Sep 2024) term. Nevertheless, it seems logical for me that we try in the direction of the wider contextual search, exploiting/experimenting with a simple ontology to see how and/or prove that it works

robknapen commented 4 months ago

The wider contextual search (expand search terms with broader terms, synonyms, semantically similar words, etc.) does not require use of an LLM, I think it can be solved with classical NLP and e.g. word embeddings. Also it does not require the text generation part of an LLM to create output.

So if we focus on enhancing the search functionality in the metadata catalogue, does that mean we pause the development of this NLQ component and will not put a chatbot in the UI? (instead introduce a new component for it or add the functionality to an already existing component in the architecture?)

BerkvensNick commented 4 months ago

I had a few questions @robknapen and @roblokers :

Regarding the use case "a LLM generated query on multiple fields" and that this would be codable in conventional ways. I guess currently people will be searching in the catalogue based on "keywords", however, when the knowledge graphs is enriched (e.g. with the ontology Beichen and Luis are developing) and I assume also then more complex, do you think "a LLM generated query" would be valuable to find documents for more complex searches or will this still be codable in conventional ways?

Secondly, I guess the use case "Go a step further and analyze the content of the identified documents, making knowledge in the documents directly available for the user" or in other words the RAG approach you mentioned, seems to be the only approach that could add value to the current catalogue? Or am I wrong in this? If so, I assume this would imply setting up a vector database containing. JRC has frequently mentioned they do not have financial resources for a lot of storage. Do you see the vector database as a large storage cost? Or is there maybe a way to populate the vector database on the fly with only the "chunks" for the identified documents related to a search of a user and then perform RAG? (don't think this is feasible, just checking)

roblokers commented 4 months ago

With regard to the second point: can we imagine that similar to HV datasets we would have HV documents, e.g. a limited set of "leading" knowledge on soil health? In that case we might have RAG based on a manageable subset that will be stored in some form(s), while others would rely on DOI linkages,

robknapen commented 4 months ago

Sure, any kind of hybrid approach is possible and will have trade-offs between (embedding) storage and compute costs, latency, processing time, and relevance (and other metrics) of the generated response.

LLMs are interesting tools for (multi-lingual) information retrieval. RAG and combining them with knowledge graphs help with grounding answers, handling more complex searches, work with domain specific and up-to-date information, etc. Usually this is in a Q&A or conversational context. It is a bit overkill to consider them to improve on simple keyword based search in a catalogue or few-term queries over structured data.

BerkvensNick commented 4 months ago

I think we might have some HV documents (e.g. the document Luis and Beichen are working on and other in e.g. [https://github.com/soilwise-he/natural-language-querying/issues/12]), and I assume we can also get hold of relevant documents via DOI linkages (?), but is there a way to not have to permanently store these documents (or chunks) in a vector database to keep storage to a minimum, this also populate on the fly? or can we assume storage will be minimal for RAG?

ok, just saw your answer higher, thanks!

robknapen commented 4 months ago

Keep in mind that 'populating on the fly' sounds easy, but it can take long (depending on the number and size of the documents, from minutes to many hours) while incurring costs (every time), which is kind of wasteful.

soilwise-he / natural-language-querying

What will be the UI that will use the NLQ component? #5