spdx / spdx-3-model

Other
67 stars 42 forks source link

Cyclic dependency of definitions in the Model #35

Closed maxhbr closed 4 months ago

maxhbr commented 1 year ago

in https://github.com/spdx/spdx-3-model/commit/92c25fcb420934c5b32f7809cd46fc12d3d6182d one has the following cyclic dependency which is annoying when implementing:

This causes a cyclic dependency between implementing modules which is really bad in some programming languages and at least annoying in others.

iamwillbar commented 1 year ago

Yeah, I need to go back through the minutes. I don't think Actor is meant to inherit from Element.

maxhbr commented 1 year ago

2023-01-24_19-13-45

zvr commented 1 year ago

The arrow from CreationInformation goes to Identity, but it should go to Entity.

But this is minor, since we decided that it's a pointer and not an embedded Element.

maxhbr commented 1 year ago

The issue is, that the definition reads like

  1. a Creation Info points to a createdBy of type Entity
  2. an Element contains a Creation Info
  3. a Entity is something that extends an Element

So, Entity is used before it is defined. This is the essence of the cycle and make implementing and working very hard.

task: if you think there is no cycle, write the definitions without using something before it is defined

iamwillbar commented 1 year ago

In the createdBy of an entity it can refer to itself, for example, this pseudocode is valid:

{
  "SPDXID": "urn:william",
  "creationInformation": {
    "createdBy": ["urn:william"],
    ...
  },
  ...
}
seabass-labrax commented 1 year ago

I believe there are two options that can be taken:

  1. The Element Creator field contains an 'anonymous' class (that is, contains a 'struct')
  2. The Element Creator field contains a pointer to an Element (that is, contains the IRI of an Element)

Both of these options have been discussed in SPDX Tech Team calls, and have been reflected at different times in diagrams committed to this repository.

Option 2, pointer to an Element, means that there will, logically speaking, always be at least one cycle in the supergraph of all SPDX Elements. In practice, it is likely that every SPDX data producer will at first create an SPDX Identity Element describing themselves, causing a self-referential cycle for each SPDX data producer. This is a different kind of cyclic dependency than the one that @maxhbr describes at the beginning of this issue thread: it's about the SPDX Elements in a graph, not the SPDX model definitions.

davaya commented 1 year ago

Entity is not "used before it is defined". Software that creates an element (e.g., a Python dict) has two fields with the same value.

In contrast to William's example, the model is not modeling recursive values:

{
  "SPDXID": "urn:william",
  "creationInformation": {
    "createdBy": [
      {
        "SPDXID": "urn:william",
        "creationInformation": {
          "creatdBy": [
            {
              "SPDXID": "urn:william",
              "creationInformation": {
                "creatdBy": [
                   ...

ERROR 506: Out of Memory
maxhbr commented 1 year ago

In the createdBy of an entity it can refer to itself, for example, this pseudocode is valid:

This pseudocode is talking about a different problem.

maxhbr commented 1 year ago

Entity is not "used before it is defined". Software that creates an element (e.g., a Python dict) has two fields with the same value.

but the term Entity is used before it is defined.

I am not talking about self reference or the circle that is created by being its own creator. I am talking about the problem in https://github.com/spdx/spdx-3-model/issues/35#issuecomment-1402394929

davaya commented 1 year ago

A schema is declarative, not procedural. All the words go onto the paper at the same time, or typed in in any order. A person can type or a program can create:

Entity = {...} Element = {...} CreationInfo = {...}

in any order. Undefined references are detected at runtime when the complete model is used to generate or validate data instances, not while editing the model (except perhaps as error flags in a smart editor that had better disappear before editing is complete).

maxhbr commented 1 year ago

Yes, you are right, one has the power to write and define such structures. That is true, since also the diagram is able to define it.

But whether it is a good or not is a different question. And this issue here talks about the negative implications of this complexity introduced by it.

davaya commented 1 year ago

My vote would be that:

  1. "goodness" is a property of implementations
  2. defining the model in a way that supports references (as in William's example) is good, prohibiting them is bad
  3. prohibiting dependencies between model definitions is an effect of code design, not programming language *
  4. therefore bending the model to accommodate implementation limitations is bad

* Proof: JSON schema supports cyclic references and is supported in many programming languages, compiled and interpreted, weakly and strongly typed.

As in my example, container associations must form a DAG. But prohibiting cycles doesn't mean the model must prohibit containers, it just means the model uses links where cycles could occur (as in self-created identity elements).

Back in the days of punch cards I do remember languages where forward references were an error, but I don't think any of them are in use today :-).

maxhbr commented 1 year ago
  1. "goodness" is a property of implementations

If the data model is good to understand and good to implement and does not make it hard to describe is a property of the model.

  • Proof: JSON schema supports cyclic references and is supported in many programming languages, compiled and interpreted, weakly and strongly typed.

This argues that it is theoretically possible to handle it and you thus agree with that there is the described cycle. So it is not a proof that there is no cycle, which I asked for.

Back in the days of punch cards I do remember languages where forward references were an error, but I don't think any of them are in use today :-).

I think forward references within a file are supported in many places. But the initial argument was, that if you would implement it in multiple modules (or in any modular way), these modules would then have a cyclic dependency on each other. And that is discouraged and unsupported in many places.

So, after we agreed that this property described here is true, lets focus on whether it is desirable in its current form.

maxhbr commented 1 year ago

This is just a random thought, not a proposal:

I know, that we probably would not be able to agree on it but just to throw a way of resolving this issue into the round: 2023-01-25_07-41-56

davaya commented 1 year ago

So it is not a proof that there is no cycle, which I asked for.

I'm afraid I'm lost - are you talking about a class inheritance cycle or a data instance cycle? There are no class cycles in the SPDX model, just as there are no class cycles in the genealogy model. There can certainly be data cycles, including nonsense graphs like Bob is his own grandfather, but the model itself is not cyclic.

But the initial argument was, that if you would implement it in multiple modules (or in any modular way), these modules would then have a cyclic dependency on each other.

Is it your opinion that there is no modular way to do genealogy using this model? If you believe that there is not, then I say the implementation of "modularity" is incorrect, not the model.

spdx-person1

maxhbr commented 1 year ago

Is it your opinion that there is no modular way to do genealogy using this model?

Your example does not contain such a cycle. You would create that problem if you would reverse the arrow between Element and Marriage

maxhbr commented 1 year ago

I'm afraid I'm lost - are you talking about a class inheritance cycle or a data instance cycle?

It contains the inheritance, but that is not the only relation that is in the cycle.

Let me write it again:

  1. to define what an Element is, you need to understand what CreationInformation is, since it has a property of Type CreationInformation
  2. to define the CreationInformation, you need to understand the type Entity, since it contains or has a reference or a pointer to something of type Entity
  3. to define Entity, you need to understand Element, since Entity inherits from Element
  4. to define what an Element is, you need to understand what CreationInformation is
  5. ...

(define is here not just "define in code" but also relevant when writing a specification) (I know that there are technical solutions to implement arbitrary data models and the argument here is not that it is impossible)

Correctness of modularization

If you believe that there is not, then I say the implementation of "modularity" is incorrect, not the model.

I think it would be fairly reasonable to define Element and Entity in different modules. With the cycle above, the element module needs to know the entity module and the entity module needs to know the element module.

davaya commented 1 year ago

CreationInformation is shown as a struct, but it is a group of properties that have been given a name. To simplify the discussion, assume that Element is defined as it previously was, with the properties listed directly in Element.

Do you agree that the name "CreationInformation" is a convenience to make the picture of Element more readable, and that factoring out those properties into a struct has no effect on the existence of model cycles?

(Jeff already raised a concern about what default CreationInformation means: is it all or nothing, or does each property have its own default? For serialization to work it must be possible for each creation property to have an independent value, and if the only way to achieve that is to show those properties directly in Element, then we need to do so even if it makes the picture of Element bigger.)

In your top drawing with orange lines, substitute "createdBy" for "creationInfo" in Element. There is then no drawing of a cycle. Entity and Identity and Person have all the properties of Element including createdBy, but there is no cycle.

Saying Element has a property that is an Element is the same as my example saying Person has a property that is a Person (ignore my Element and Marriage, they're not relevant for the cycle discussion),

Thought experiment:

Which of these models has cycles? None of them have type definition cycles, but all of them can model data instances with creation cycles.

My position is that software that thinks some of them are cyclic but others are not is not a good modular implementation. Either all 4 models have "cycles" or none of them do, depending on whether one is talking about type definition cycles or data instance cycles.

spdx-elements

maxhbr commented 11 months ago

This was brought up again by the Lite profile team on the SPDX Mini Summit:

2023-09-18_15-09-25

I think we have not resolved all concerns here.

davaya commented 11 months ago

Pretend that CreationInfo did not exist, and that CreatedBy were a direct property of Element. In that case Agent would have a self-reference cycle. What is the problem with that? Why is it shown in red? An Agent instance can be created by a different Agent, and an Agent instance can be created by the same Agent.

In data instances it is just data - the Actor id IRI and its creationInfo/createdBy IRI can be the same value.

In the real world producers can use their own private keys to sign their public keys - "self-signed certificates" are a real thing. A real-world person can produce an Actor element for himself and designate himself as its creator.

davaya commented 11 months ago

Back when Sebastian was leading the Canonicalization focus group, I submitted this serialization diagram that distinguishes between datatypes that have no @id and classtypes that do have an @id. @zvr: The logical model has a directory called "Classes" and a directory called "Datatypes", but unlike William's diagram which correctly puts CreationInfo, NamespaceMap, ExternalMap, IntegrityMethod, PositiveIntegerRange, etc in the datatypes section, the model files put them in the "Classes" section. It should not, because CreationInfo does not have an @id.

In any case, the mini summit diagram also does not distinguish between classtypes and datatypes. Any arrow whose destination is a classtype (e.g., Sbom -> Element) should be depicted differently (e.g., dashed) from arrows whose destination is a datatype (e.g., Sbom -> CreationInfo): solid. The difference is that dashed references can be serialized as id values (SpdxId IRIs), while solid containers cannot. Cycles of container (or subclassOf) edges are an error, if Sbom, Actor, and CreationInfo were all datatypes, then the back link would be invalid. But cycles can be broken by making any single edge in the cycle a reference instead of a container.

Make Agent -> Element a solid grey, since that's what all other subclass edges are. Make all references such as CreationInfo -> Agent dashed black to show that they are references. And make all containers such as Agent -> CreationInfo solid black to show that they are datatypes. The only thing that is a red error is a cycle consisting solely of solid edges.

(The legend could show both "baz (datatype)" with a solid line and "baz (classtype)" with a dashed line as the range of the property, since the distinction is meaningful in graph analysis.)

image

maxhbr commented 11 months ago

@davaya : I think you are looking at a different thing, and we all accept that you can have cycles in the graph things can reference themself. The problem in this issue is a cycle of isInstanceOf or subClassOf arrows. Your example talks about Contain and Reference arrows, which is something different. This ticket is talking about a cycle in the model diagram and not a cycle in data.

davaya commented 11 months ago

OK, but I don't see it. Element is the root class, Sbom and Agent are a subclasses of Element, and CreationInfo is a property of Element. The rightmost red edge is either "CreationInfo is a property of Element" (a correct data relationship) or "Element is a subclass of CreationInfo", which is a nonsense subclassing relationship.

The diagram needs to make the alleged subclassOf relationships clear. There cannot be subclassOf cycles, but AFAIK the model files don't have any. Sbom has a CreationInfo property, Agent has a CreationInfo property, and so does every other element in the model. Nothing wrong with that.

goneall commented 10 months ago

I suggest this be resolved in documentation. I agree with @davaya that this isn't really an issue, but since it does come up from time to time, we need to document what an Agent can have creationInfo with properties pointing to itself.

This could be done in the Agent class description, the CreationInfo class description and the createdBy property description.

rnjudge commented 4 months ago

4/9/2024 - Max agreed we can close this issue. We can re-open if implementers run into this issue.

maxhbr commented 4 months ago

@rnjudge : I fixed the status and closed it as "not planned" instead of "completed", this does better match the outcome