ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
73 stars 34 forks source link

How does one represent a downloadable file in UCO? #534

Open ajnelson-nist opened 1 year ago

ajnelson-nist commented 1 year ago

Disclaimer

Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose.

Question

In UCO, what rdf:types should be assigned to an ObservableObject that is a downloadable file?

This need arises from at least two directions:

Dataset distribution: While posting reference data, download sites will often host images for delivery over HTTP(S). At least one RDF-based model, DCAT, encourages storing a reference to the downloadable URL as a rdfs:Resource IRI, and treating that IRI as a file---just a file that hasn't been downloaded yet. See property dcat:downloadURL.

Software supply chain: In Software Supply Chain representation, frequently metadata about software packages will include a download URL and hashes corresponding to the download URL. See for example this metadata manifest for case-utils' recent release, retrieved and manually trimmed from this API endpoint:

{
    "urls": [
        {
            "digests": {
                "blake2b_256": "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9",
                "md5": "ade3eae9b5a5ef0fedfbc065abf79ae7",
                "sha256": "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"
            },
            "filename": "case_utils-0.10.0-py3-none-any.whl",
            "md5_digest": "ade3eae9b5a5ef0fedfbc065abf79ae7",
            "size": 537812,
            "upload_time_iso_8601": "2023-03-31T16:12:49.931860Z",
            "url": "https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl"
        }
    ]
}

It seems this is something UCO is designed to be able to represent, but the classes and properties that look like the best candidates for doing so have not received significant exercising. Some of them are not documented, and some have fairly lax constraints leftover from the prototyping days pre-dating the ObservableObject subclass hierarchy.

First, because of the representation suggested by DCAT, which is not distinct to DCAT, I am specifically interested in how to represent the url resource in that JSON dictionary as an IRI, without being wholly reliant on duck-typing:

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a uco-core:UcoObject ;
    .

(As an aside, this also disregards the UCO guidance that IRIs end with UUIDs. In at least the DCAT application, this is just going to have to be the case. We could consider this an exercise of UCO as an enricher of existing knowledge bases. The implementation for UCO Issue 430, where the UUID requirement was introduced, specifically allowed for this use case.)

From CASE/UCO duck-typing, I know I want this to have a URLFacet, FileFacet, and ContentDataFacet. So, here is how I would add Facets to say that that UcoObject behaves like a observable:File, observable:URL, and observable:ContentData. (This next block, and code blocks further in this post, should be read as additive on top of what is written in preceding blocks.)

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    uco-core:hasFacet
        kb:ContentDataFacet-2e1a9cee-1353-471d-b318-92fc9da7280b ,
        kb:FileFacet-82fd5577-bed0-4f7f-ba3f-08d3583c2efb ,
        kb:URLFacet-a78e2688-44b8-4eb9-b474-33c5e2b3c32a
        ;
    .

kb:ContentDataFacet-2e1a9cee-1353-471d-b318-92fc9da7280b
    a uco-observable:ContentDataFacet ;
    uco-observable:hash
        kb:Hash-90ae9698-8259-4a48-a74c-c9c149b447df ,
        kb:Hash-bb8e8808-e7f6-4759-81b5-2df00865d2bc ,
        kb:Hash-cb51e845-086c-43a7-99ef-6d44569e2143
        ;
    uco-observable:sizeInBytes 537812 ;
    .

kb:FileFacet-82fd5577-bed0-4f7f-ba3f-08d3583c2efb
    a uco-observable:FileFacet ;
    uco-observable:fileName "case_utils-0.10.0-py3-none-any.whl" ;
    uco-observable:sizeInBytes 537812 ;
    .

kb:Hash-90ae9698-8259-4a48-a74c-c9c149b447df
    a uco-types:Hash ;
    rdfs:comment "Beside the point of the illustration - this Hash uses a method not currently encoded in UCO, but this is consistent with UCO's semi-open vocabulary pattern."@en ;
    uco-types:hashMethod "blake2b_256" ;
    uco-types:hashValue "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9"^^xsd:hexBinary ;
    .

kb:Hash-bb8e8808-e7f6-4759-81b5-2df00865d2bc
    a uco-types:Hash ;
    uco-types:hashMethod "MD5"^^uco-vocabulary:HashNameVocab ;
    uco-types:hashValue "ade3eae9b5a5ef0fedfbc065abf79ae7"^^xsd:hexBinary ;
    .

kb:Hash-cb51e845-086c-43a7-99ef-6d44569e2143
    a uco-types:Hash ;
    uco-types:hashMethod "SHA256"^^uco-vocabulary:HashNameVocab ;
    uco-types:hashValue "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"^^xsd:hexBinary ;
    .

kb:URLFacet-a78e2688-44b8-4eb9-b474-33c5e2b3c32a
    a uco-observable::URLFacet ;
    uco-observable:fullValue "https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl" ;
    .

Prior practice within UCO suggests that the rdf:type of the UcoObject should be at least uco-observable:ObservableObject, though nothing in the classes and properties used above is encoded to require this:

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a uco-observable:ObservableObject ;
    .

Now, it's not clear whether these types would also be correct to assert. (As noted in prior Issues, UCO's usage of duck typing is not inferentially bidirectional. Being a File implies having a FileFacet. Behaving like a File, i.e. having a FileFacet, does not imply that the UcoObject is a File. An Issue coming soon will propose encoding this.)

<https://files.pythonhosted.org/packages/d4/f9/28260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9/case_utils-0.10.0-py3-none-any.whl>
    a
        uco-observable:ContentData ,
        uco-observable:File ,
        uco-observable:URL
        ;
    .

By UCO's current encoding, those three OWL Classes would be fine to use concurrently on the same node, because UCO does not define anything about them as disjoint from one another. (UCO's only currently disjointedness axiom separates uco-core:UcoObject from uco-core:UcoInherentCharacterizationThing, the superclass of uco-core:Facet.) But, is this a practice UCO should encourage users to adopt, or discourage users from adopting?

There are at least the following non-trivialities with using the three classes concurrently.

Confusion with rdfs:Resource

One point that makes handling this not obvious is that UCO's observable:URL class in this case becomes a bit confused with the RDF foundational class rdfs:Resource.

Potential decision on inherence of files

Another point is that it is possible UCO could go the route of defining File as inherent to FileSystem. (This would take a healthy amount of discussion, as there would be significant pros and cons on this. A proposal on this isn't coming today.) If this decision were adopted, could a URL also be a File? Or do we need a Relationship defined to represent that a URL is (or was at some time), say, a projection of, or access channel to, a file on a file system, or an object in an S3 bucket (as done by Digital Corpora)?

Difference in occurrence of ContentData and URL

Another point is that observable:ContentData doesn't have any relationship (that is, subclass-based or predicate-based) with other ObservableObjects encoded in the ontology. Does this snippet of the JSON above ...

{
"digests": {
    "blake2b_256": "d4f928260b3e9335605ac2093779e9780acaaba2c0794a47a53822a0c98e52d9",
    "md5": "ade3eae9b5a5ef0fedfbc065abf79ae7",
    "sha256": "daf617d96b1dc74b2953f82067365b1858cbe0e9d4a9d2659091f23951129bc1"
}

... describe the URL? Or, does it describe some more abstract content-signature pattern? If the latter, how does this relate to the URL? Would a File relate the same way?

kb:ContentDataPattern-8bd128e1-f096-4e6f-8c68-0af88d4df52e
    a
        uco-observable:ContentData,
        uco-pattern:Pattern
        ;
    # uco-core:hasFacet links to a ContentDataFacet, linking to the Hashes ...
    .

(The UCO Pattern namespace has not, to date, been demonstrated in any public CASE or UCO examples.)

Payload reference URL

Separately but relatedly, there is a property observable:dataPayloadReferenceURL. It lacks a definition (rdfs:comment) in the ontology, and only constraints its range to observable:ObservableObject. It has not been demonstrated on the CASE website. It has been demonstrated a few times in CASE-Examples:

CASE-Corpora currently does not demonstrate observable:dataPayloadReferenceURL, but I'm considering adding a shape (scoped only to CASE-Corpora for now) that tailors its usage for "Downloadable files," a class where each member of the class reflexively treats its own IRI as its URL fullValue and its content-data dataPayloadReferenceURL. Within CASE-Corpora, DCAT influences this decision. This Issue is filed in part to affirm or dissuade that class design.

Summary

UCO seems like it has all the pieces available to express that "I know of large file X, durably archived at this URL, and it has these hashes." But demonstration is needed to test UCO's class design, stepping past the relaxed model permitted by duck typing.

The above lay out questions that I believe will be essential in UCO's efforts towards software supply chain analysis and certain steps in cross-organization-boundary data sharing. I look forward to the opportunity to discuss these approaches and clarify these class and property interactions.

sbarnum commented 1 year ago

The design intent for UCO was not for a single ObservableObject to conflate File and URL together. The intended way to describe a "downloadable file" is to convey a File object, a separate URL object and a Relationship object with source=, target=, and kindOfRelationship="downloadable_from". I would assert that this is the most appropriate, correct, clear and fault-tolerant way to do this. If you wished to convey content details of the file such as hashes and such you could take one of two approaches: 1) include both a FileFacet and a ContentDataFacet on the File object, 2) include a separate ContentData ObservableObject and a Relationship object with source=, target=, and kindOfRelationship="contains".

Separating the File and the URL and associating them with a relationship avoids the complexities explicitly or implicitly identified in the writeup above. It also yields a much more effective graph where the same file can be downloadable from multiple URLs, the download URL may change over time, etc.

If a formal disjoint axiom between File and URL classes is believed necessary to avoid anyone getting confused and conflating them onto the same object for this use case then I would support such an action.

It is pointed out above that the observable:dataPayloadReferenceURL is missing a definition. This looks like an unfortunately oversight. The intended purpose for this property is to provide a link to where the actual content of a ContentDataFacet (on a ContentData, File, Memory, etc object) could be stored. This is if the content is desired to be available but is too large to share encoded in the observable:dataPayload property or not desired to directly express in the UCO object.