Disclaimer

Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose.

Question

In UCO, what rdf:types should be assigned to an ObservableObject that is a downloadable file?

This need arises from at least two directions:

Dataset distribution: While posting reference data, download sites will often host images for delivery over HTTP(S). At least one RDF-based model, DCAT, encourages storing a reference to the downloadable URL as a rdfs:Resource IRI, and treating that IRI as a file---just a file that hasn't been downloaded yet. See property dcat:downloadURL.

Software supply chain: In Software Supply Chain representation, frequently metadata about software packages will include a download URL and hashes corresponding to the download URL. See for example this metadata manifest for case-utils' recent release, retrieved and manually trimmed from this API endpoint:

{
    "urls": [
        {
            "digests": {
                "blake2b_256": "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9",
                "md5": "ade3eae9b5a5ef0fedfbc065abf79ae7",
                "sha256": "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"
            },
            "filename": "case_utils-0.10.0-py3-none-any.whl",
            "md5_digest": "ade3eae9b5a5ef0fedfbc065abf79ae7",
            "size": 537812,
            "upload_time_iso_8601": "2023-03-31T16:12:49.931860Z",
            "url": "https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl"
        }
    ]
}

It seems this is something UCO is designed to be able to represent, but the classes and properties that look like the best candidates for doing so have not received significant exercising. Some of them are not documented, and some have fairly lax constraints leftover from the prototyping days pre-dating the ObservableObject subclass hierarchy.

First, because of the representation suggested by DCAT, which is not distinct to DCAT, I am specifically interested in how to represent the url resource in that JSON dictionary as an IRI, without being wholly reliant on duck-typing:

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a uco-core:UcoObject ;
    .

(As an aside, this also disregards the UCO guidance that IRIs end with UUIDs. In at least the DCAT application, this is just going to have to be the case. We could consider this an exercise of UCO as an enricher of existing knowledge bases. The implementation for UCO Issue 430, where the UUID requirement was introduced, specifically allowed for this use case.)

From CASE/UCO duck-typing, I know I want this to have a URLFacet, FileFacet, and ContentDataFacet. So, here is how I would add Facets to say that that UcoObject behaves like a observable:File, observable:URL, and observable:ContentData. (This next block, and code blocks further in this post, should be read as additive on top of what is written in preceding blocks.)

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    uco-core:hasFacet
        kb:ContentDataFacet-2e1a9cee-1353-471d-b318-92fc9da7280b ,
        kb:FileFacet-82fd5577-bed0-4f7f-ba3f-08d3583c2efb ,
        kb:URLFacet-a78e2688-44b8-4eb9-b474-33c5e2b3c32a
        ;
    .

kb:ContentDataFacet-2e1a9cee-1353-471d-b318-92fc9da7280b
    a uco-observable:ContentDataFacet ;
    uco-observable:hash
        kb:Hash-90ae9698-8259-4a48-a74c-c9c149b447df ,
        kb:Hash-bb8e8808-e7f6-4759-81b5-2df00865d2bc ,
        kb:Hash-cb51e845-086c-43a7-99ef-6d44569e2143
        ;
    uco-observable:sizeInBytes 537812 ;
    .

kb:FileFacet-82fd5577-bed0-4f7f-ba3f-08d3583c2efb
    a uco-observable:FileFacet ;
    uco-observable:fileName "case_utils-0.10.0-py3-none-any.whl" ;
    uco-observable:sizeInBytes 537812 ;
    .

kb:Hash-90ae9698-8259-4a48-a74c-c9c149b447df
    a uco-types:Hash ;
    rdfs:comment "Beside the point of the illustration - this Hash uses a method not currently encoded in UCO, but this is consistent with UCO's semi-open vocabulary pattern."@en ;
    uco-types:hashMethod "blake2b_256" ;
    uco-types:hashValue "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9"^^xsd:hexBinary ;
    .

kb:Hash-bb8e8808-e7f6-4759-81b5-2df00865d2bc
    a uco-types:Hash ;
    uco-types:hashMethod "MD5"^^uco-vocabulary:HashNameVocab ;
    uco-types:hashValue "ade3eae9b5a5ef0fedfbc065abf79ae7"^^xsd:hexBinary ;
    .

kb:Hash-cb51e845-086c-43a7-99ef-6d44569e2143
    a uco-types:Hash ;
    uco-types:hashMethod "SHA256"^^uco-vocabulary:HashNameVocab ;
    uco-types:hashValue "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"^^xsd:hexBinary ;
    .

kb:URLFacet-a78e2688-44b8-4eb9-b474-33c5e2b3c32a
    a uco-observable::URLFacet ;
    uco-observable:fullValue "https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl" ;
    .

Prior practice within UCO suggests that the rdf:type of the UcoObject should be at least uco-observable:ObservableObject, though nothing in the classes and properties used above is encoded to require this:

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a uco-observable:ObservableObject ;
    .

Now, it's not clear whether these types would also be correct to assert. (As noted in prior Issues, UCO's usage of duck typing is not inferentially bidirectional. Being a File implies having a FileFacet. Behaving like a File, i.e. having a FileFacet, does not imply that the UcoObject is a File. An Issue coming soon will propose encoding this.)

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a
        uco-observable:ContentData ,
        uco-observable:File ,
        uco-observable:URL
        ;
    .

By UCO's current encoding, those three OWL Classes would be fine to use concurrently on the same node, because UCO does not define anything about them as disjoint from one another. (UCO's only currently disjointedness axiom separates uco-core:UcoObject from uco-core:UcoInherentCharacterizationThing, the superclass of uco-core:Facet.) But, is this a practice UCO should encourage users to adopt, or discourage users from adopting?

There are at least the following non-trivialities with using the three classes concurrently.

Confusion with rdfs:Resource

One point that makes handling this not obvious is that UCO's observable:URL class in this case becomes a bit confused with the RDF foundational class rdfs:Resource.

Potential decision on inherence of files

Another point is that it is possible UCO could go the route of defining File as inherent to FileSystem. (This would take a healthy amount of discussion, as there would be significant pros and cons on this. A proposal on this isn't coming today.) If this decision were adopted, could a URL also be a File? Or do we need a Relationship defined to represent that a URL is (or was at some time), say, a projection of, or access channel to, a file on a file system, or an object in an S3 bucket (as done by Digital Corpora)?

Difference in occurrence of ContentData and URL

Another point is that observable:ContentData doesn't have any relationship (that is, subclass-based or predicate-based) with other ObservableObjects encoded in the ontology. Does this snippet of the JSON above ...

{
"digests": {
    "blake2b_256": "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9",
    "md5": "ade3eae9b5a5ef0fedfbc065abf79ae7",
    "sha256": "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"
}

... describe the URL? Or, does it describe some more abstract content-signature pattern? If the latter, how does this relate to the URL? Would a File relate the same way?

kb:ContentDataPattern-8bd128e1-f096-4e6f-8c68-0af88d4df52e
    a
        uco-observable:ContentData,
        uco-pattern:Pattern
        ;
    # uco-core:hasFacet links to a ContentDataFacet, linking to the Hashes ...
    .

(The UCO Pattern namespace has not, to date, been demonstrated in any public CASE or UCO examples.)

Payload reference URL

Separately but relatedly, there is a property observable:dataPayloadReferenceURL. It lacks a definition (rdfs:comment) in the ontology, and only constraints its range to observable:ObservableObject. It has not been demonstrated on the CASE website. It has been demonstrated a few times in CASE-Examples:

Oresteia.json, showing where to download an attachment (which is only typed as observable:ContentData) from an external website.
message.json, showing where to download attachments (each of which is only typed as observable:ContentData) from an external website.
network_connection.json, showing where a file is stored in a locally-mounted file system. Though, this specific example portion has an inlined design question.

CASE-Corpora currently does not demonstrate observable:dataPayloadReferenceURL, but I'm considering adding a shape (scoped only to CASE-Corpora for now) that tailors its usage for "Downloadable files," a class where each member of the class reflexively treats its own IRI as its URL fullValue and its content-data dataPayloadReferenceURL. Within CASE-Corpora, DCAT influences this decision. This Issue is filed in part to affirm or dissuade that class design.

Summary

UCO seems like it has all the pieces available to express that "I know of large file X, durably archived at this URL, and it has these hashes." But demonstration is needed to test UCO's class design, stepping past the relaxed model permitted by duck typing.

The above lay out questions that I believe will be essential in UCO's efforts towards software supply chain analysis and certain steps in cross-organization-boundary data sharing. I look forward to the opportunity to discuss these approaches and clarify these class and property interactions.

ucoProject / UCO

How does one represent a downloadable file in UCO? #534