soilwise-he / soilwise-ontology

the soilwise ontology is a model for a knowledge graph including datasets, organisations and repositories
MIT License
0 stars 0 forks source link

How to mark knowledge items (in the KG) that are generated or enhanced by AI? #2

Open robknapen opened 6 months ago

robknapen commented 6 months ago

Where ever AI models are used to generate or enhance e.g. knowledge items or metadata we want to mark it as 'non-human' contributions. Sometimes it might also be mixed (i.e. a human written abstract, extended with AI generated text). We need a good way to save this characteristic, e.g. as part / attribute of the knowledge graph. It might be a specific extension or ontology?

robknapen commented 6 months ago

Some approaches that can be considered:

The last two might be the best solutions performance/size wise.

robknapen commented 5 months ago

Asking Mistral LLM:

what is a good way to indicate in an ontology the difference between content provided by humans and content generated or augmented by artificial intelligence?

In an ontology, it's essential to differentiate between data that is generated or provided by humans (also known as ground truth or expert knowledge) versus data that is generated or augmented by artificial intelligence. One common approach for distinguishing these two types of data in an ontology involves using metadata properties or labels.

Here are a few suggestions on how you might indicate the difference between human-provided and AI-generated data in your ontology:

  1. Add a new property or metadata tag to represent the source of the data: Create a new property, such as rdf:type <http://example.com/ontology/dataSource#>, with subclasses like HumanDataSource and AIDataSource. Assign this property to each data instance in your ontology, indicating whether it is derived from human sources or AI systems. For example:
<http://example.com/soil1> a PhysicalProperties:PhysicalProperties ;
    texture [ a rdf:List ; 1st <http://example.com/sand> ; 2nd <http://example.com/silt> ] ;
    structure [ a rdf:List ; 1st <http://example.com/wellDrained> ] ;
    rdf:type <http://example.com/ontology/dataSource#> ;
    rdf:property <http://example.com/ontology/dataSource#> value <http://example.com/HumanDataSource> .
  1. Use a distinct namespace or prefix for AI-generated data: Assign a unique namespace or prefix to the ontology elements representing AI-generated data, distinguishing them from human-provided data. This can make it easier to filter and query data based on its source in your ontology.

  2. Add versioning or provenance metadata: Include additional metadata properties such as version or provenance, which capture information about the origin of the data, when it was generated, and by whom. These properties can help provide context for understanding the data's source and trustworthiness in your ontology.

  3. Incorporate a trust model: Develop a trust model to assess the reliability or accuracy of AI-generated versus human-provided data, taking into account factors like data sources, algorithms, and context. This can help ensure that your ontology maintains high-quality data while making use of both human expertise and AI capabilities.

  4. Apply access control: Apply access control measures to restrict or grant access to different types of data based on their source, ensuring that sensitive or critical information remains protected in your ontology. This can be especially important when dealing with large amounts of data generated by AI systems, which may require additional security considerations.

Ultimately, the best approach for distinguishing between human-provided and AI-generated data in an ontology depends on the specific requirements of your use case, the level of trust you have in your AI systems, and the desired outcomes from your ontology. By choosing a method that suits your needs, you can effectively manage and leverage both types of data to build accurate, comprehensive, and valuable knowledge models.

robknapen commented 5 months ago

I have moved this issue to this repository, since the NLQ component will not be updating or enhancing knowledge items, only using them as context for generating responses to questions (which is AI output in any case). Marking items as Human/AI content needs to be done where it is manipulated. I.e. when harvesting writes it to the knowledge base or when metadata enhancement takes place using machine learning.

When properly marked the NLQ component can include this kind of information in the response to the user asking questions.

wbcbugfree commented 2 months ago

https://github.com/soilwise-he/soil-health-knowledge-graph provides an implementation of the quad store with named graphs to solve this issue. In this implementation, all AI-generated content, here basically keywords, is stored in a named graph called "generated", isolated from the raw metadata records. These RDF quads are serialized using TriG notation (Turtle plus graphs).