Define schema for project update posts

blushi commented 9 months ago

We need to define the schema for project posts content which should include (TBC):

title (110 char max)
comment
optional list of files: file name, optional description, optional photo credit, location

Privacy settings: The entire post content can be private. The files can be private. The files locations can be private.

blushi commented 9 months ago

Hey team! Please add your planning poker estimate with Zenhub @aaronc @blushi

aaronc commented 8 months ago

I discussed this briefly with @blushi today.

Here are my thoughts:

let's use either Dublin Core or schema.org as much as possible for properties. I was suggesting in #79 that we prefer Dublin Core because of more usage in the scientific domain but would be happy to hear others' opinions. Title, description, and probably locations and files should be covered somewhat by both schemas already
let's use WKT #78 for locations
let's not overuse lists #81 for ease of queryability - also open to discussion

blushi commented 8 months ago

cc/ @paul121

paul121 commented 8 months ago

+1 to WKT and fewer linked-lists :+1:

The choice for the standard/schema is interesting. Generally I've been thinking it would make sense to use schema.org for these project updates but I really haven't given it much thought up until now. I'm not very familiar with Dublin Core and just doing some research now, but realizing I have seen the DC prefix used in various places (@prefix dc: <http://purl.org/dc/elements/1.1/>.), so perhaps I am more familiar than I thought. Most of DC vocabulary seems to be included here: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

Some of my thoughts:

Up until now... I had been thinking it would be nice to leverage schema.org datasets within the project update, perhaps as the "optional list of files". But I don't think this would work as the "project update post" itself. Perhaps something from the schema.org Article/SocialMediaPosting/BlogPost hierarchy could be used for the post document itself.
Both Schema.org and DC can be used across a wide range of domains, but schema.org seems to have many many more specific terms/properties. In some ways this is nice, many decisions have already been made, but it also feels like this could be constraining. This aspect of schema.org seems to be somewhat conflicting with the DC concept of Open World Design. I like the idea that we start with a simple foundation that can be expanded over time in a non-breaking way. But also acknowledge that we have short-term needs and need to make some decisions now and schema.org might just be "easier" right now.
I generally agree that it would be nice to use DC and Darwin Core instead of schema.org, especially when there are solid conventions other domains are already using. Simple things like title, description, author, etc are certainly covered by DC. There are some very broad classes/types for Dataset, StillImage, Event (see here) but these don't have their own set of specific properties like their equivalents in schema.org. I'd like to see more examples of these DC classes and how they are used.
That said, a nice side effect of schema.org is that it seems to be used for SEO and "web content" use-cases. This might make it easier to to integrate project updates into other search feeds or embeddings on other web pages.
- ...but I also feel like the schema.org metadata required for these use-cases could easily be generated from the lower level standard we decide on. We don't necessarily need to use the schema.org standard as the one that project update documents are stored in.
- In fact, it might be better that the schema.org representation is a "wrapper" of some kind. I'm thinking about how a project update is anchored on-chain but actually accessed or viewed via other means. It could be shared via the Regen Marketplace or from a website/webapp specific to that project developer. The schema.org representation would differ for each one.
Of course, we can combine parts of multiple standards. This is likely better than re-creating our own.

I'm starting to wonder... are project updates meant to be "web content" in their native form? Or are they really meant to be (semi) scientific observations, claims, datasets, etc? I may need a refresher on the scope/requirements for the Registry Web App. But in a general sense I think I'm leaning towards structuring or conceptualizing these as more "scientific" in their native form, and thus DC and DWC are interesting, but I would like to learn more/see more examples. I also may be associating schema.org too closely with only "web content" use-cases.

paul121 commented 8 months ago

Chatting today:

Investigate integrating RDFa into the React app. Does this require SSR to actually be effective?
@paul121 to work on example JSON-LD using both Schema.org and DC. We can use this to compare the standards and evaluate what might be missing from one or the other.
Also consider adding a "root" location to the post, not only location on files.

paul121 commented 8 months ago

Examples in JSON-LD Playground:

Dublic Core (dcterms and dcmitype` prefixes): Example%22%7D%2C%22dcterms%3Acreator%22%3A%5B%22Sally%20Jane%22%2C%22regen1234address1234xyz%22%5D%2C%22dcterms%3Acontributor%22%3A%5B%22Bob%20Smith%22%2C%22regen1234address5678xyz%22%5D%2C%22dcterms%3Asource%22%3A%22%20....%20The%20long%20form%20text%20of%20this%20post.%22%7D%2C%7B%22%40id%22%3A%22ex%3Aprivate_post%22%2C%22dcterms%3Atype%22%3A%22dcmitype%3AText%22%2C%22dcterms%3Aformat%22%3A%22text%2Fhtml%22%2C%22dcterms%3AaccessRights%22%3A%22regen%3AprivateAccess%22%2C%22dcterms%3Atitle%22%3A%22Private%20Post%22%2C%22dcterms%3Adescription%22%3A%22A%20private%20post%20for%20only%20project%20admins.%22%2C%22dcterms%3Aabstract%22%3A%22A%20brief%20summary%20about%20the%20site%20visit.%22%2C%22dcterms%3Acreated%22%3A%222023-01-01%22%2C%22dcterms%3Acreator%22%3A%5B%22Sally%20jane%22%2C%22regen1234address1234xyz%22%5D%2C%22dcterms%3Asource%22%3A%22%20....%20The%20long%20form%20text%20of%20this%20post.%22%7D%2C%7B%22%40id%22%3A%22ex%3Aprivate_file%22%2C%22dcterms%3Atype%22%3A%22dcmitype%3AImage%22%2C%22dcterms%3Aformat%22%3A%22image%2Fpng%22%2C%22dcterms%3AaccessRights%22%3A%22regen%3AprivateAccess%22%2C%22dcterms%3Atitle%22%3A%22Private%20Image%22%2C%22dcterms%3Adescription%22%3A%22A%20private%20post%20for%20only%20project%20admins.%22%2C%22dcterms%3Acreated%22%3A%222023-01-01%22%2C%22dcterms%3Acreator%22%3A%5B%22Jane%20Doe%22%2C%22regen1234address1234xyz%22%5D%2C%22dcterms%3Asource%22%3A%22regen%3A1234.rdf%22%7D%2C%7B%22%40id%22%3A%22ex%3Apublic_event%22%2C%22dcterms%3Atype%22%3A%22dcmitype%3AEvent%22%2C%22dcterms%3AaccessRights%22%3A%22regen%3ApublicAccess%22%2C%22dcterms%3Atitle%22%3A%22Project%20Site%20Visit%20(Event)%22%2C%22dcterms%3Adescription%22%3A%22A%20proper%20Site%20Visit%20Event.%22%2C%22dcterms%3Adate%22%3A%222023-01-01%22%2C%22dcterms%3Acreator%22%3A%5B%22Sally%20Jane%22%2C%22regen1234address1234xyz%22%5D%7D%5D%7D&context=%7B%22%40context%22%3A%7B%22dcterms%22%3A%22http%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%22%2C%22dcmitype%22%3A%22http%3A%2F%2Fpurl.org%2Fdc%2Fdcmitype%2F%22%2C%22ex%22%3A%22http%3A%2F%2Fexample.org%2Fvocab%23%22%2C%22xsd%22%3A%22http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%22%2C%22ex%3Acontains%22%3A%7B%22%40type%22%3A%22%40id%22%7D%7D%7D)
Schema.org (largely based off CreativeWork): Example%22%7D%2C%22schema%3AhasPart%22%3A%5B%7B%22%40id%22%3A%22ex%3Apublic_post%22%7D%2C%7B%22%40id%22%3A%22ex%3Aprivate_post%22%7D%2C%7B%22%40id%22%3A%22ex%3Aprivate_file%22%7D%5D%7D%2C%7B%22%40id%22%3A%22%40ex%3Apublic_post%22%2C%22%40type%22%3A%22schema%3AArticle%22%2C%22schema%3Aname%22%3A%22A%20simple%20post%20(ext)%22%2C%22schema%3Adescription%22%3A%22A%20simple%20post%22%2C%22schema%3AdatePublished%22%3A%222023-01-01%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3ApublicAccess%22%2C%22schema%3Aauthor%22%3A%5B%22Sally%20Jane%22%2C%22regen1234address1234xyz%22%5D%2C%22schema%3Aspatial%22%3A%7B%22%40id%22%3A%22ex%3Apost_location%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3AprivateAccess%22%2C%22geo%3AhasGeometry%22%3A%22POLYGON((-77.089005%2038.913574%2C-77.029953%2038.913574%2C-77.029953%2038.886321%2C-77.089005%2038.886321%2C-77.089005%2038.913574))%22%7D%2C%22schema%3Atext%22%3A%22....%20The%20longform%20text.%22%7D%2C%7B%22%40id%22%3A%22%40ex%3Aprivate_post%22%2C%22%40type%22%3A%22schema%3AArticle%22%2C%22schema%3Aname%22%3A%22A%20simple%20post%20(text)%22%2C%22schema%3Adescription%22%3A%22A%20simple%20post%22%2C%22schema%3AdatePublished%22%3A%222023-01-01%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3ApublicAccess%22%2C%22schema%3Aauthor%22%3A%5B%22Sally%20Jane%22%2C%22regen1234address1234xyz%22%5D%2C%22schema%3Aspatial%22%3A%7B%22%40id%22%3A%22ex%3Apost_location%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3ApublicAccess%22%2C%22geo%3AhasGeometry%22%3A%22POLYGON((-77.089005%2038.913574%2C-77.029953%2038.913574%2C-77.029953%2038.886321%2C-77.089005%2038.886321%2C-77.089005%2038.913574))%22%7D%2C%22schema%3Atext%22%3A%22....%20The%20longform%20text.%22%7D%2C%7B%22%40id%22%3A%22%40ex%3Aprivate_file%22%2C%22%40type%22%3A%22schema%3AImageObject%22%2C%22schema%3Aname%22%3A%22A%20private%20image%22%2C%22schema%3Adescription%22%3A%22A%20private%20image%22%2C%22schema%3AdatePublished%22%3A%222023-01-01%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3AprivateAccess%22%2C%22schema%3Aauthor%22%3A%5B%22Sally%20Jane%22%2C%22regen1234address1234xyz%22%5D%2C%22schema%3Aspatial%22%3A%7B%22%40id%22%3A%22ex%3Apost_location%22%2C%22schema%3AconditionsOfAccess%22%3A%22regen%3AprivateAccess%22%2C%22geo%3AhasGeometry%22%3A%22POLYGON((-77.089005%2038.913574%2C-77.029953%2038.913574%2C-77.029953%2038.886321%2C-77.089005%2038.886321%2C-77.089005%2038.913574))%22%7D%2C%22schema%3Atext%22%3A%22....%20The%20longform%20text.%22%7D%5D%7D)

Both examples should be roughly equivalent. In general I tried to model as follows:

Both standards have a concept of a Collection. I'm using this for what would be a "Post" that contains multiple files. The collection reference files via dcterms:references or schema:hasPart.
Both examples could let individual files be "Posts" themselves (a single-file post). Or we could choose to always use a "Collection" wrapper.
The "Collection" and sub-file items can have an access restriction (dcterms:accessRights vs schema:conditionsOfAccess with value regen:public/privateAccess) as well as a location (both standards use spatial to denote location).
I'm trying to use WKT literals in here but this might be slightly wrong in both, but still gets the point across. Each location object could use the same access restriction property to denote more granular access restrictions.

Some initial thoughts:

Schema.org schema:conditionsOfAccess is designed to be a literal text value. But we probably want to reference some kind of enum that denotes access levels. That could be plain text but I think it is proper to make it under the regen: prefix? I don't see any other good "access" properties that are not explicitly licenses or rights in schema.org. We could make our own too. Meanwhile, the dcterms:accessRights is described to be for this use-case, and generalized to be text or a subject reference.
I do like the generalized aspect of DC. It feels simpler and was easier to choose what properties would be relevant. Schema.org has so much to sift through. I don't know if a simple Text post would be a schema.org Article, Blog, Social Media, and then all the sub properties...
But maybe Schema.org is a better choice for this reason, there is simple more semantic meaning to terms/classes in that standard. With more time we can better evaluate these.

aaronc commented 8 months ago

I think the post would generally be the top level element, and then the file would be some collection that is associated with it.

The access rights I believe would be stored outside of the post in the database so we probably don't want to include that here. Likely ditto for the author.

I think it would be helpful to narrow this down to the existing JSON elements that we already have. @blushi do you have a sample JSON blob of what a post would look like (without any special RDF schema) given what we have already defined?

paul121 commented 7 months ago

Re: post as top level, yes I agree. I think I was getting a little hung up on how to use collections. The collection could be a simple sub-element on the post that then references files. But unless we have additional properties to assign to the collection (like a location or access rights), it might just be easier to reference files directly from the post.

Re: author, I see why this wouldn't need to be included, especially if only used for access control. I'm just holding some thought to how this same post schema could be used elsewhere (we would like to reuse for SeaTrees) where the author could be a more useful property. But easy enough for others to add an author as needed.

More generally re: access rights, I agree this should be stored outside the post. Although this makes me wonder how parts of the access logic will be implemented and how it impacts the schema design. Specifically how we ensure private data is not returned via API. Has this been decided?:

Will each post be a single JSON-LD document that is parsed to potentially redact private information (the entire post, files, or locations) when requested via API? This would be easiest with general permissions eg: allow all/no files, not allow only some files.
OR could we have separate documents for each part/access level of the post and delegate access logic to the "data resolver" level: given an IRI either return the entire document, or return 403. (iirc existing access logic is implemented using postgraphile, but I'm not sure for the status for IRIs/data resolver endpoint)
Also, how will the individual files be referenced/stored? Will each file get an IRI with content hash anchored on-chain, separate from the post?

It seems there could be some elegance in creating separate documents and maintaining a single, relatively simple implementation for access logic where each IRI has its access logic/owner/etc stored in the database. This could be reused for future use-cases of anchored data too and seems to be inline with the larger vision of a use-case for data revolvers to implement access control. But it could also make the schema a little more complex eg: requiring two documents for a public file with a private location.

A simplified structure could be:

- Post
  - Type - dcmitype:Text / schema:CreativeWork
  - Title
  - Description
  - Date
  - Author
  - Location (perhaps a separate document)
  - Collection (single reference) OR Files (multiple reference), both using dcterms:references / schema:hasPart

- Collection 
  - Type - dcmitype:Collection / schema:Collection
  - Files (multiple reference via dcterms:references / schema:hasPart)

- File
 - Type - dcmitype:Image / schema:ImageObject
 - Title
 - Description
 - Credit
 - Location (perhaps a separate document)

blushi commented 7 months ago

Will each post be a single JSON-LD document that is parsed to potentially redact private information (the entire post, files, or locations) when requested via API? This would be easiest with general permissions eg: allow all/no files, not allow only some files.

Yes see current implementation of that: https://github.com/regen-network/regen-server/blob/4f12a5b25b1593ffb5dadd36b2005ad76428d0eb/server/routes/posts.ts#L315

Author and privacy settings are indeed currently stored as separate database columns, see https://github.com/regen-network/regen-server/blob/4f12a5b25b1593ffb5dadd36b2005ad76428d0eb/migrations/committed/000047.sql

Also, how will the individual files be referenced/stored? Will each file get an IRI with content hash anchored on-chain, separate from the post?

Yes this is what I was thinking about.

We don't need to store a location for a post itself, only for the individual files.

I think it would be helpful to narrow this down to the existing JSON elements that we already have. @blushi do you have a sample JSON blob of what a post would look like (without any special RDF schema) given what we have already defined?

I had something like this in mind for the post json contents:

title
comment
list of files:
- iri (either using @id or a standalone property TBD)
- name
- description
- location as WKT
- credit (for photos)
- location type: file geolocation, no specific location ie file associated to project location or specific location (ref: https://www.figma.com/file/Bksz1JeDYxQVIXdI46EgPT/Project-Posts?type=design&node-id=1410-76798&mode=design&t=CijEQEobERpuxGR4-0), although this could also be retrieved programmatically so I'm not sure if that should be stored in the post contents, this will be useful when we support editing posts.

paul121 commented 7 months ago

Here is a simple JSON. Includes a file for each type that is listed in the figma design: "Supported file types include text, spreadsheets, images and video files"

{
    "title": "Post Title",
    "comment": "Short comment about the post",
    "files": [
        {
            "iri": "regen:1111.png",
            "name": "herding.png",
            "description": "Image description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:2222.mp4",
            "name": "herding.mp4",
            "description": "Video description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:3333.txt",
            "name": "textfile.txt",
            "description": "Text description"
        },
        {
            "iri": "regen:4444.csv",
            "name": "spreadsheet.csv",
            "description": "Spreadsheet description"
        }
    ]
}

paul121 commented 7 months ago

location type: file geolocation, no specific location ie file associated to project location or specific location (ref: https://www.figma.com/file/Bksz1JeDYxQVIXdI46EgPT/Project-Posts?type=design&node-id=1410-76798&mode=design&t=CijEQEobERpuxGR4-0), although this could also be retrieved programmatically so I'm not sure if that should be stored in the post contents, this will be useful when we support editing posts.

Yeah this is interesting. It could be retrieved programmatically, but storing it on the post would make future indexing with the location much easier. And only require the location to be extracted from the image once when creating the post/file.

Seeing the above json, a couple ideas:

Should there be any timestamp included in this post?
The file type can be derived from the IRI raw media type. But in JSON-LD it would be convenient to have explicit an explicit Type + Encoding attribute. Especially if the schema will vary by file type (eg: only images/videos? have "credit").

These things might not be as necessary for this initial implementation of project updates backed by regen-server, but considering this could be a standard for project updates more generally, these are small things that would go a long ways towards making project updates more standardized.

aaronc commented 7 months ago

Here is a simple JSON. Includes a file for each type that is listed in the figma design: "Supported file types include text, spreadsheets, images and video files"

{
    "title": "Post Title",
    "comment": "Short comment about the post",
    "files": [
        {
            "iri": "regen:1111.png",
            "name": "herding.png",
            "description": "Image description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:2222.mp4",
            "name": "herding.mp4",
            "description": "Video description",
            "location": "POINT(1 2)",
            "credit": "Photographer name"
        },
        {
            "iri": "regen:3333.txt",
            "name": "textfile.txt",
            "description": "Text description"
        },
        {
            "iri": "regen:4444.csv",
            "name": "spreadsheet.csv",
            "description": "Spreadsheet description"
        }
    ]
}

So if we used dubin core, we could do the following mappings:

name, title -> title
comment, description -> description
credit -> maybe creator?
iri -> identifier or make it the subject
location -> maybe coverage or spatial?
files -> not finding a mapping

Seems like schema.org also has a pretty similar set of items. I still feel like I'm lacking a good understanding of what either of these frameworks would really get us to the point where I'm almost inclined to just define our own properties in the regen schema namespace.

paul121 commented 7 months ago

location -> maybe coverage or spatial?

It looks like spatial is recommended. Although I'm curious to see if there is a common convention for how to include WKT within geospatial/geosparql contexts.

files -> not finding a mapping

Above I used dcterms:references and schema:hasPart for this.

paul121 commented 7 months ago

location -> maybe coverage or spatial?

It looks like spatial is recommended. Although I'm curious to see if there is a common convention for how to include WKT within geospatial/geosparql contexts.

So GeoSPARQL suggests that ontologies specifically import the geo:Geometry class to describe geometries rather than use other simple encoding schemes. This is described with various examples in the rationale for the Geometry extension.

Interestingly, they also include an annex providing alignments of GeoSPARQL to other ontologies. This includes an alignment to schema.org and dublin core.

I think the TLDR is that wherever we want to include a "location" we should use a geo:hasGeometry property to reference a geo:Geometry class with a geo:asWKT property asserting the WKT serialization of a given geometry. This is the equivalent of dcterms:spatial. They provide a nice demo dataset that actually uses other dublin core properties, too: https://github.com/opengeospatial/ogc-geosparql/blob/f98b6e4b3bd9de62afe5c2a2ffd81639917d79ac/examples/demo-dataset.ttl#L256-L278

They also provide an example query to find features with a geo:asWKT serialization within a bounding box. This would map quite well to files under the project post, just consider that feature == file. https://opengeospatial.github.io/ogc-geosparql/geosparql11/spec.html#C.2.2.2

aaronc commented 7 months ago

Should we do a vote on Dublin core vs schema.org vs neither?

aaronc commented 7 months ago

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

paul121 commented 6 months ago

Here is a pass at using LinkML to model the schema for project posts w/ some explanation of the approach I took: https://gist.github.com/paul121/1d83c0d4dcdf06c3bcff44a4c42cffd7

Should we do a vote on Dublin core vs schema.org vs neither?

I would vote for DC, primarily because I continue seeing it used in various places (semantic OGC standards, FAIR data), and it allows us to leverage a standard without the scope-creep and additional meaning the may come with schema.org. This project post use-case is so simple it's hard to argue that any vocabulary will "give us much" right now. But eventually when we do have Regen/ecological domain specific concepts it will likely be better to create our own terms for those specific things rather than try to make schema.org fit. Ideally DC can be a framework to help build out these domain specific concepts.

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

I'm curious how important the order is for semantics. Can we depend on the data resolver to return the JSON-LD document the same as was anchored or is that too fragile? I describe in the gist, it's quite elegant just referencing Regen IRIs as subjects + objects without the need for additional blank/list nodes. But we could add a simple order property as well.

blushi commented 6 months ago

Here is a pass at using LinkML to model the schema for project posts w/ some explanation of the approach I took: https://gist.github.com/paul121/1d83c0d4dcdf06c3bcff44a4c42cffd7

Thanks @paul121 looks great!

Should we do a vote on Dublin core vs schema.org vs neither?

I would vote for DC, primarily because I continue seeing it used in various places (semantic OGC standards, FAIR data), and it allows us to leverage a standard without the scope-creep and additional meaning the may come with schema.org. This project post use-case is so simple it's hard to argue that any vocabulary will "give us much" right now. But eventually when we do have Regen/ecological domain specific concepts it will likely be better to create our own terms for those specific things rather than try to make schema.org fit. Ideally DC can be a framework to help build out these domain specific concepts.

Agreed

Also what will our strategy be for ordered lists? An order property or an actual RDF list?

I'm curious how important the order is for semantics. Can we depend on the data resolver to return the JSON-LD document the same as was anchored or is that too fragile? I describe in the gist, it's quite elegant just referencing Regen IRIs as subjects + objects without the need for additional blank/list nodes. But we could add a simple order property as well.

I believe having some order property would be safer.

regen-network / regen-registry-standards

Define schema for project update posts #82