regen-network / regen-registry-standards

:seedling: RDF and SHACL schemas for Regen Registry
4 stars 1 forks source link

Define schema for project update posts #82

Open blushi opened 9 months ago

blushi commented 9 months ago

We need to define the schema for project posts content which should include (TBC):

Privacy settings: The entire post content can be private. The files can be private. The files locations can be private.

blushi commented 9 months ago

Hey team! Please add your planning poker estimate with Zenhub @aaronc @blushi

aaronc commented 8 months ago

I discussed this briefly with @blushi today.

Here are my thoughts:

  1. let's use either Dublin Core or schema.org as much as possible for properties. I was suggesting in #79 that we prefer Dublin Core because of more usage in the scientific domain but would be happy to hear others' opinions. Title, description, and probably locations and files should be covered somewhat by both schemas already
  2. let's use WKT #78 for locations
  3. let's not overuse lists #81 for ease of queryability - also open to discussion
blushi commented 8 months ago

cc/ @paul121

paul121 commented 8 months ago

+1 to WKT and fewer linked-lists :+1:

The choice for the standard/schema is interesting. Generally I've been thinking it would make sense to use schema.org for these project updates but I really haven't given it much thought up until now. I'm not very familiar with Dublin Core and just doing some research now, but realizing I have seen the DC prefix used in various places (@prefix dc: <http://purl.org/dc/elements/1.1/>.), so perhaps I am more familiar than I thought. Most of DC vocabulary seems to be included here: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

Some of my thoughts:

I'm starting to wonder... are project updates meant to be "web content" in their native form? Or are they really meant to be (semi) scientific observations, claims, datasets, etc? I may need a refresher on the scope/requirements for the Registry Web App. But in a general sense I think I'm leaning towards structuring or conceptualizing these as more "scientific" in their native form, and thus DC and DWC are interesting, but I would like to learn more/see more examples. I also may be associating schema.org too closely with only "web content" use-cases.

paul121 commented 8 months ago

Chatting today:

paul121 commented 8 months ago

Examples in JSON-LD Playground:

Both examples should be roughly equivalent. In general I tried to model as follows:

Some initial thoughts:

aaronc commented 8 months ago

I think the post would generally be the top level element, and then the file would be some collection that is associated with it.

The access rights I believe would be stored outside of the post in the database so we probably don't want to include that here. Likely ditto for the author.

I think it would be helpful to narrow this down to the existing JSON elements that we already have. @blushi do you have a sample JSON blob of what a post would look like (without any special RDF schema) given what we have already defined?

paul121 commented 7 months ago

Re: post as top level, yes I agree. I think I was getting a little hung up on how to use collections. The collection could be a simple sub-element on the post that then references files. But unless we have additional properties to assign to the collection (like a location or access rights), it might just be easier to reference files directly from the post.

Re: author, I see why this wouldn't need to be included, especially if only used for access control. I'm just holding some thought to how this same post schema could be used elsewhere (we would like to reuse for SeaTrees) where the author could be a more useful property. But easy enough for others to add an author as needed.

More generally re: access rights, I agree this should be stored outside the post. Although this makes me wonder how parts of the access logic will be implemented and how it impacts the schema design. Specifically how we ensure private data is not returned via API. Has this been decided?:

It seems there could be some elegance in creating separate documents and maintaining a single, relatively simple implementation for access logic where each IRI has its access logic/owner/etc stored in the database. This could be reused for future use-cases of anchored data too and seems to be inline with the larger vision of a use-case for data revolvers to implement access control. But it could also make the schema a little more complex eg: requiring two documents for a public file with a private location.

A simplified structure could be:

- Post
  - Type - dcmitype:Text / schema:CreativeWork
  - Title
  - Description
  - Date
  - Author
  - Location (perhaps a separate document)
  - Collection (single reference) OR Files (multiple reference), both using dcterms:references / schema:hasPart

- Collection 
  - Type - dcmitype:Collection / schema:Collection
  - Files (multiple reference via dcterms:references / schema:hasPart)

- File
 - Type - dcmitype:Image / schema:ImageObject
 - Title
 - Description
 - Credit
 - Location (perhaps a separate document)
blushi commented 7 months ago

Will each post be a single JSON-LD document that is parsed to potentially redact private information (the entire post, files, or locations) when requested via API? This would be easiest with general permissions eg: allow all/no files, not allow only some files.

Yes see current implementation of that: https://github.com/regen-network/regen-server/blob/4f12a5b25b1593ffb5dadd36b2005ad76428d0eb/server/routes/posts.ts#L315

Author and privacy settings are indeed currently stored as separate database columns, see https://github.com/regen-network/regen-server/blob/4f12a5b25b1593ffb5dadd36b2005ad76428d0eb/migrations/committed/000047.sql

Also, how will the individual files be referenced/stored? Will each file get an IRI with content hash anchored on-chain, separate from the post?

Yes this is what I was thinking about.

We don't need to store a location for a post itself, only for the individual files.

I think it would be helpful to narrow this down to the existing JSON elements that we already have. @blushi do you have a sample JSON blob of what a post would look like (without any special RDF schema) given what we have already defined?

I had something like this in mind for the post json contents:

paul121 commented 7 months ago

Here is a simple JSON. Includes a file for each type that is listed in the figma design: "Supported file types include text, spreadsheets, images and video files"

{
    "title": "Post Title",
    "comment": "Short comment about the post",
    "files": [
        {
            "iri": "regen:1111.png",
            "name": "herding.png",
            "description": "Image description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:2222.mp4",
            "name": "herding.mp4",
            "description": "Video description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:3333.txt",
            "name": "textfile.txt",
            "description": "Text description"
        },
        {
            "iri": "regen:4444.csv",
            "name": "spreadsheet.csv",
            "description": "Spreadsheet description"
        }
    ]
}
paul121 commented 7 months ago

location type: file geolocation, no specific location ie file associated to project location or specific location (ref: https://www.figma.com/file/Bksz1JeDYxQVIXdI46EgPT/Project-Posts?type=design&node-id=1410-76798&mode=design&t=CijEQEobERpuxGR4-0), although this could also be retrieved programmatically so I'm not sure if that should be stored in the post contents, this will be useful when we support editing posts.

Yeah this is interesting. It could be retrieved programmatically, but storing it on the post would make future indexing with the location much easier. And only require the location to be extracted from the image once when creating the post/file.

Seeing the above json, a couple ideas:

These things might not be as necessary for this initial implementation of project updates backed by regen-server, but considering this could be a standard for project updates more generally, these are small things that would go a long ways towards making project updates more standardized.

aaronc commented 7 months ago

Here is a simple JSON. Includes a file for each type that is listed in the figma design: "Supported file types include text, spreadsheets, images and video files"

{
    "title": "Post Title",
    "comment": "Short comment about the post",
    "files": [
        {
            "iri": "regen:1111.png",
            "name": "herding.png",
            "description": "Image description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:2222.mp4",
            "name": "herding.mp4",
            "description": "Video description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:3333.txt",
            "name": "textfile.txt",
            "description": "Text description"
        },
        {
            "iri": "regen:4444.csv",
            "name": "spreadsheet.csv",
            "description": "Spreadsheet description"
        }
    ]
}

So if we used dubin core, we could do the following mappings:

Seems like schema.org also has a pretty similar set of items. I still feel like I'm lacking a good understanding of what either of these frameworks would really get us to the point where I'm almost inclined to just define our own properties in the regen schema namespace.

paul121 commented 7 months ago

location -> maybe coverage or spatial?

It looks like spatial is recommended. Although I'm curious to see if there is a common convention for how to include WKT within geospatial/geosparql contexts.

files -> not finding a mapping

Above I used dcterms:references and schema:hasPart for this.

paul121 commented 7 months ago

location -> maybe coverage or spatial?

It looks like spatial is recommended. Although I'm curious to see if there is a common convention for how to include WKT within geospatial/geosparql contexts.

So GeoSPARQL suggests that ontologies specifically import the geo:Geometry class to describe geometries rather than use other simple encoding schemes. This is described with various examples in the rationale for the Geometry extension.

Interestingly, they also include an annex providing alignments of GeoSPARQL to other ontologies. This includes an alignment to schema.org and dublin core.

I think the TLDR is that wherever we want to include a "location" we should use a geo:hasGeometry property to reference a geo:Geometry class with a geo:asWKT property asserting the WKT serialization of a given geometry. This is the equivalent of dcterms:spatial. They provide a nice demo dataset that actually uses other dublin core properties, too: https://github.com/opengeospatial/ogc-geosparql/blob/f98b6e4b3bd9de62afe5c2a2ffd81639917d79ac/examples/demo-dataset.ttl#L256-L278

They also provide an example query to find features with a geo:asWKT serialization within a bounding box. This would map quite well to files under the project post, just consider that feature == file. https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html#C.2.2.2

aaronc commented 7 months ago

Should we do a vote on Dublin core vs schema.org vs neither?

aaronc commented 7 months ago

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

paul121 commented 6 months ago

Here is a pass at using LinkML to model the schema for project posts w/ some explanation of the approach I took: https://gist.github.com/paul121/1d83c0d4dcdf06c3bcff44a4c42cffd7

Should we do a vote on Dublin core vs schema.org vs neither?

I would vote for DC, primarily because I continue seeing it used in various places (semantic OGC standards, FAIR data), and it allows us to leverage a standard without the scope-creep and additional meaning the may come with schema.org. This project post use-case is so simple it's hard to argue that any vocabulary will "give us much" right now. But eventually when we do have Regen/ecological domain specific concepts it will likely be better to create our own terms for those specific things rather than try to make schema.org fit. Ideally DC can be a framework to help build out these domain specific concepts.

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

I'm curious how important the order is for semantics. Can we depend on the data resolver to return the JSON-LD document the same as was anchored or is that too fragile? I describe in the gist, it's quite elegant just referencing Regen IRIs as subjects + objects without the need for additional blank/list nodes. But we could add a simple order property as well.

blushi commented 6 months ago

Here is a pass at using LinkML to model the schema for project posts w/ some explanation of the approach I took: https://gist.github.com/paul121/1d83c0d4dcdf06c3bcff44a4c42cffd7

Thanks @paul121 looks great!

Should we do a vote on Dublin core vs schema.org vs neither?

I would vote for DC, primarily because I continue seeing it used in various places (semantic OGC standards, FAIR data), and it allows us to leverage a standard without the scope-creep and additional meaning the may come with schema.org. This project post use-case is so simple it's hard to argue that any vocabulary will "give us much" right now. But eventually when we do have Regen/ecological domain specific concepts it will likely be better to create our own terms for those specific things rather than try to make schema.org fit. Ideally DC can be a framework to help build out these domain specific concepts.

Agreed

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

I'm curious how important the order is for semantics. Can we depend on the data resolver to return the JSON-LD document the same as was anchored or is that too fragile? I describe in the gist, it's quite elegant just referencing Regen IRIs as subjects + objects without the need for additional blank/list nodes. But we could add a simple order property as well.

I believe having some order property would be safer.