soilwise-he / natural-language-querying

Application component that provides Natural Language Querying (NLQ) services, making knowledge stored in a graph database accessible for e.g. a ChatBot UI.
MIT License
0 stars 0 forks source link

Provide a filled Graph Database and other input resources to be used by the NLQ component #12

Open robknapen opened 2 months ago

robknapen commented 2 months ago

Some relevant content in a Graph Database (initially the Virtuoso triple store?) and perhaps a set of documents describing core soil concepts is needed to get started with development.

The core documents will help provide basic soil knowledge to the LLM, which might only have been trained on general soil information and documents, if any.

BerkvensNick commented 1 month ago

@robknapen I assume the Virtuosso set up by Hugo and Paul can be used? I will also contact soil science specialists at ILVO to ask whether they know documents describing core soil concepts.

robknapen commented 1 month ago

I think so, but still have to check.

robknapen commented 1 month ago

Currently it seems very limited. There are two named graphs, basically lists without much further structure. The subject is a CORDIS URI (an extraction from CORDIS based on search criteria), and per subject there is a dcat#title and a datacite/doi available (in one of the graphs).

I think we either need to harvest more information, have the interlinker add more metadata (tags, keywords, description, etc.), or have the NLQ component using the DOI to retrieve documents and fill a vector database from them.

@pvgenuchten @hugodegrootwurnl @roblokers how can this be further developed? Or am I not seeing something that is in virtuoso already?

robknapen commented 1 month ago

I found some information about what we intend to be harvested here: ingestion. I would also suggest to collect as many things like topics, themes, subjects, tags, and descriptions, summaries, abstracts as possible. And make sure those are recognisable (i.e. marked in some way) as ‘human’ provided text, and not generated by some SoilWise AI algorithm (which we might apply later to enrich the graph).

pvgenuchten commented 1 month ago

RobK you are correct, the current set is very limited, we need to put some effort asap in ingesting other sources, such as inspire and openaire, before sensible queries can be made…

BerkvensNick commented 1 month ago

I asked some colleagues here at ILVO and the following sources could be interesting documents describing core soil concepts:

robknapen commented 1 month ago

Thanks for the references @BerkvensNick, those could indeed be of interest. But might need some processing effort since AgroVOC is a very large thesaurus with many terms so perhaps some filtering and extraction of relevant things is needed. For the book I don't seem to have access to a digital version. Our WUR library has a printed edition, but it is copyrighted material anyway so not free to use it I guess.

BerkvensNick commented 1 month ago

the mail of Fenny about the Australian National Soil Information System (ANSIS) uses standards/vocabularies and documents that could also be valuable/interesting: Australian National Soil Information System (ANSIS) - standards

pvgenuchten commented 1 month ago

a starting point could be these documents (collected in sharepoint/wp2/background):

BerkvensNick commented 1 month ago

I also got these 2 references from one of our soil specialists:

hugodegrootwurnl commented 1 month ago

Currently it seems very limited. There are two named graphs, basically lists without much further structure. The subject is a CORDIS URI (an extraction from CORDIS based on search criteria), and per subject there is a dcat#title and a datacite/doi available (in one of the graphs).

I think we either need to harvest more information, have the interlinker add more metadata (tags, keywords, description, etc.), or have the NLQ component using the DOI to retrieve documents and fill a vector database from them.

@pvgenuchten @hugodegrootwurnl @roblokers how can this be further developed? Or am I not seeing something that is in virtuoso already?

The Virtuoso instance has been filled with a lot more; Query all available attributes from a certain DOI:

https://sparql.soilwise-he.containers.wurnet.nl/sparql/

Named Graph: https://cordis.europa.eu/datalab/sparql-endpoint/CordisSoil2594Publications

Query: PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX eurio:http://data.europa.eu/s66# PREFIX dcterms: http://purl.org/dc/terms/ PREFIX datacite: http://purl.org/spar/datacite/ select ?sub ?pred ?obj where { ?sub ?pred ?obj { select ?sub WHERE { ?sub ?pred ?obj FILTER (?obj= "10.1002/2017JG004269"^^http://www.w3.org/2001/XMLSchema#anyURI) FILTER (?pred=http://purl.org/spar/datacite/doi) } } }