Model collections - Githubissues

apdavison commented 3 years ago

A common scenario in modelling is that we have a large number of similar/related single neuron models. Each model needs to have a separate representation in the KG, as they can be used individually, and may have different validation reference data, etc. However, we don't want to flood the KG Search results with hundreds of such models, rather the user should retrieve a model collection, with links to the individual component models.

Currently we achieve this by keeping the individual models in the Model Catalog, and releasing a uniminds ModelInstance to represent the collection.

This is a hack, and for openMINDS / KG v3 I'd like to do things more cleanly.

What I propose is to use the "hasSupplementVersion" property of ModelVersion to hold the links from the ModelVersion representing the collection to the list of component single neuron models.

The question remains: how to hide the individual component models in the KG Search? If this is the only use case for "hasSupplementVersion", then I guess the KG UI logic could exclude models which are "supplements" to another from the search results, or have a checkbox to allow the option of including such models. An alternative approach would be to add a ModelCollection schema to openMINDS.

apdavison commented 3 years ago

@olinux @lzehl what do you think?

lzehl commented 3 years ago

Dear @apdavison this is an important use case and similar for datasets (although with much larger collections, I suppose). We received multiple requests by users and reviewers to "hide" versions in the KG Search and only present them on direct request.

I had something like the following in mind for this, but did not yet fully discuss this with @olinux or the development team (I think it matches what you suggest, but maybe not completely?):

the KG Search visualizes dominantly only the conceptual research products (Dataset, Model, Software, MetaDataModel)
the research product versions are listed there according to their relation:
- supplementing versions together (with a "download all" option, because they supplement each other to provide the full research product)
- alternative versions listed next to each other (maybe stating the major difference between them, e.g. contentType of data)
- sequential versions listed in order (latest at the top/coming first)
individual research versions should be still represented in the KG Search but only on direct request through the conceptual research product

If this would not work for your use case could you maybe give a more concrete example? I like the checkbox you suggested for defining for the supplement versions how they are displayed.

@olinux your thoughts on this?

apdavison commented 3 years ago

@lzehl This is close to the model collection use case; the main difference is that each of the component single neuron models has both a Model and at least one ModelVersion, so we would still need to be able to hide (or group) the Model products in the KG Search.

lzehl commented 3 years ago

@apdavison I think I do not understand the structure of this completely... Let me ask a couple of questions (to see where I might have the wrong assumption):

1) There is a main Model with a set of ModelVersions? 2) These main ModelVersions are defined by some code, but that code integrates also various other (sub)Models and all their versions or actually one version for each of these (sub)Models? 3) Can those (sub)Models be used independently of the main Model or not? Meaning should they be also used by other main Models?

apdavison commented 3 years ago

@lzehl Let me try to restate the problem:

There is a collection of related/similar Models. Each Model has at least one ModelVersion.
The size of the collection is somewhere between 10 and 1000.
Each Model can be used independently.

The problem is that because there are so many they would dominate the KG Search results, so we would like a single entry per collection in the results, which gives access to all the Models/ModelVersions in the collection. (Note that we have several such collections).

Possible solutions: 1. add a ModelCollection and ModelCollectionVersion schema to openMINDS In this scenario, it should be easy to exclude the members of the collection from the search results, but then we are adding yet another schema to openMINDS.

2. create a Model (with one or more associated ModelVersions) to represent the collection. In this scenario, the logic to exclude the members of the collection from the search results becomes more complex, but the advantage is we're reusing existing schemas.

lzehl commented 3 years ago

Thanks for explaining again @apdavison. Here my thoughts:

Normally I would say that for grouping related models typically a Project should be used. But that of course does not solve the problem that you do not want to flood the KG Search Results.

For solution 1: I still have some question here: why is there a ModelCollectionVersion needed? would it be not sufficient to group all Models (and with that all their versions) into one collection? Is there any other metadata you would like to capture for a modelCollection, besides using it to group related Models?

For solution 2: Assuming there is only one Collection and no CollectionVersion needed. I would maybe redefine the ResearchProduct / ResearchProductVersion schemas by moving the "hasSupplementVersion" to the ResearchProduct (that would work for datasets as well, and I suppose for software too @jagru20 ?). I would then define one Model for the whole collection and define for each related model one ModelVersion and list them in hasSupplementVersion. I would leave the hasVersions in the Model blank. For each version of the related models I would again define ModelVersions and connect those via hasNewVersion/hasAlternativeVersion with the corresponding ones in hasSupplementVersion. Note: the "hasSupplementVersion" could also be renamed to something else if needed.

apdavison commented 3 years ago

why is there a ModelCollectionVersion needed? would it be not sufficient to group all Models (and with that all their versions) into one collection?

for example, more models might be added to the collection.

lzehl commented 3 years ago

@apdavison I see.

That case could be covered in solution 2 over a Project then: Each ModelCollectionVersion would be one Model (rest is the same as stated above) and these Models are grouped into a Project.

Would that work?

Solution 3 (similar to 2 but a different angle): Leave schemas as they are, but add an optional property "isPartOfCollection" to a ResearchProduct that can link to another ResearchProduct of the same type, in your case a Model. The referenced Model(s) (entered in "isPartOfCollection") represent(s) the CollectionVersion(s) which can be grouped into one Project. Each Model that has listed another Model in "isPartOfCollection" does not need to be directly visualized in the KG Search (including their ModelVersions), only the once that are referenced in "isPartOfCollection" are grouped in a Project.

lzehl commented 3 years ago

I need to think this more through...

@apdavison could you let me know in which use case a collection version is really needed (e.g., should they always get a DOI?) I understood first that the feature you're missing is mainly for visualizing purposes and not because that this structure is needed for referencing. I'm just asking again because the versioning makes this problem much more difficult to solve cleanly... Or asking differently, is it necessary that the overall collection is citable (meaning that it get's a DOI)?

apdavison commented 3 years ago

Yes, the collection needs to be citable. It is less important that the individual members be citable.

If we add the property "isPartOfCollection" to ResearchProductVersion rather than to ResearchProduct, I think that solves the versioning problem. Then any ResearchProductVersion which for which "isPartOfCollection" is not empty should be "hidden", and any ResearchProduct for which all its versions are hidden should also be hidden.

lzehl commented 3 years ago

collectionIssue

lzehl commented 3 years ago

@apdavison sorry for all the spams on that issue today. Would the previous sketch of the model satisfy your use case? I'm not yet sure how to solve / define the list of supplements for each collection version in that case. This might have to be it's own schema eventually which is maybe even connected to the ResearchProduct as main collection again.

apdavison commented 3 years ago

@lzehl any possibility you could share the diagram with me in an editable form? then I can show what I have in mind

lzehl commented 3 years ago

@apdavison Of course, I'll send you an email.

jagru20 commented 3 years ago

Hi all, I am not sure if I can give much input to the discussion, but to answer the question on the location of hasSupplementVersion: I am not sure if that property makes Sense in the researchProduct because it would mean, that one software entity can link to a softwareVersion. In my opinion, this possibility should be - for software - located at researchProductVersion, as it is theoretically possible that different versions of the same software have different supplementVersions (i.e. the former components).

However, if hasSupplementVersion needs to be moved, it might be sensible to amend softwareVersion such that it holds a property "hasComponents" which also would be a bit more straightforward for software.

lzehl commented 3 years ago

@jagru20 yes. I thought that you might mention the "hasComponents". Let's wait for @apdavison feedback on the sketch. Please continue following this discussion so that we can solve this issue for all research products sufficiently (considering all adoption of the concept of an additional grouping for the research products and/or research product versions).

apdavison commented 3 years ago

Copy of collection issue

lzehl commented 3 years ago

@apdavison looks good I think. The ResearchProducts on the right side and the ResearchProductVersions of the different model versions in the middle should not show up in the KG Search, correct?

Some points / questions: 1) Could we rename "isSupplementedBy" to "hasComponents"? (cf. comment by @jagru20 and next point) 2) For datasets with cohort releases (first 20 subjects, second 30 subjects, third 50 subjects, -> total 100 subjects) I would like to still have the "isSupplementVersionOf" option (or something similar in name, could be also dataset specific if there is no need for models and software) between ResearchProductVersions. Or do you have a better idea? 3) We would still need to have a tag on the ResearchProducts and ResearchProductVersions that identifies them for not showing up in the KG Search, correct? 4) Do we need to formally identify the type of being a "collection" on the product cards? (I don't see a need for it)

apdavison commented 3 years ago

yes (or we could invert the connection, have "isComponentOf")
makes sense to me
that's a question for @olinux. In principle the query could exclude items that are in a "hasComponents" list (or which have an "isComponentOf" link)
no, I don't think so

olinux commented 3 years ago

Hi, Here are my thoughts about this: I see the point in having a structure like this - we should make sure though that it's well understood what it describes for both, the producer of the structure and the consumer. Therefore I would suggest to call a ResearchProduct(Version) a "composite" when it consists of multiple other ResearchProduct(Version)s which should be documented explicitly as part of the openMINDS documentation. Please object if you have a better term for this.

IMHO, we should (as discussed above):

replace "isSupplementedBy" with "hasComponents" (although technically equivalent, I would keep the direction of the connection from the higher to the lower level since it might make since it feels more natural to describe this top-down and might simplify the way multi-reuse can be displayed - but this is a subjective feeling and I could live with the inverse too).
"hasComponents" can exist on both "ResearchProduct" and "ResearchProductVersion" although "ResearchProduct" can only point to another "ResearchProduct" whilst a "ResearchProductVersion" can point to another "ResearchProduct" or a "ResearchProductVersion" (this is mostly to relax the requirement to be too specific about the version of a dependent resource e.g. of software components). For the cohorts (if I understood it correctly), I'm wondering if we shouldn't represent each cohort as an individual "ResearchProduct" which can be grouped by this composite mechanism (so the individual cohorts can be versioned themselves e.g. if there have been some internal changes / improvements / formatting ...)
We need to define if we want to make it explicit every time that a "ResearchProduct" is a component of another "ResearchComponent" (by specifying it in the "hasComponents" link) or if we want to infer this (e.g. by defining that "ResearchProduct A" automatically becomes a component of "ResearchProduct B" as soon as one of its versions is connected to a version of "B").
- If we do want to infer it, we need to decide if we want to "materialize" this inferred link (e.g. by having automated scripts adding the links for those cases) or if we want to leave it to the client to do the appropriate interpretation.
- If we don't want to infer it, we might want to think about having an automation pipeline suggesting the link to be added by the user.
The KG Search would present "cards" / entry points for root-level ResearchProducts only (which are not components themselves) and integrate the information of the components within this view similar to versions (the actual design has still to be designed). Here, we also need to answer the question if we want to handle composites of composites or if we restrict this (at least on the interpretation level) for a single composite layer only.
We should have a discussion about DOIs for these kind of structures: Should components of a bigger components get their own conceptional and/or version DOIs?

lzehl commented 3 years ago

Based on @apdavison example structure and @olinux comments I'd like you to have a look at the following drawing collection issue

lzehl commented 3 years ago

I picked up the following aspects from the suggested approaches:

ResearchProduct can ReasearchProductVersion both have the property "hasCompenents"
- for ResearchProduct this property can connect to other ResearchProducts
- for ResearchProductVersion this property can connect to other ResearchProductVersions
ResearchProduct in addition can have the property "hasVersions" which can connect to ResearchProductVersions
a ResearchProductVersion in addition can state if it "isNewVersionOf" or "isAlternativeVersionOf" another ResearchProductVersion of the same ResearchProduct (if grouped under "hasVersions")
within the system it should be possible to "hide" ResearchProducts which reflect the conceptual component / cohort and ResearchProductVersions that reflect the versioned component /cohort (all potentially elements to "hide" are marked in different gray shades)

Although quite complex in structure this seems to be the most consistent way of capturing such cases in the graph database.

USE CASE ONE: Model collections (@apdavison does this still fit?) USE CASE TWO: Datasets consisting of different subject cohort releases (@UlrikeS91 could you double check if that fits as well?) USE CASE THREE: Software with different components (@jagru20 would that fit?)

@olinux does this still fit with your thoughts as well?

olinux commented 3 years ago

Hi Lyuba, It does with the minor additional comment that imho "ResearchProductVersions" should be able to point to both, "ResearchProduct" and "ResearchProductVersion" by the "hasComponent" since it would be quite tough to represent (and maintain) the dependency graphs of Software if you can only connect versions with each other. This is especially true if you're thinking about widely used libraries which would need the tracking of all potentially used versions - unless we decide that we register major versions only -> what do you think @jagru20 ?

lzehl commented 3 years ago

@olinux & @jagru20 for "hasComponents" to be honest it does not really makes sense to me to allow pointing from a version to a concept. Would it not be sufficient to allow "only" the registration of the conceptual collection (white shade) with it's Research Product components (gray shaded) in the above depicted metadata model? Meaning for such a case the colored collection versions could be left out if they do not make sense to be explicitly captured. @olinux & @jagru20 would that cover the mentioned use case?

olinux commented 3 years ago

@lzehl this is actually not the same use-case:

What I had in mind for software was that you're registering your software - let's say "Knowledge Graph" with its version "v3" -> now "Knowledge Graph v3" depends on a component called "ArangoDB". So what I would do is to register "ArangoDB" as another Software. There's plenty of different versions for ArangoDB and the "Knowledge Graph" is trying to upgrade regularly to them.

The question is now which granularity you would like to track. You could state:

"Knowledge Graph v3 has the component ArangoDB"
"Knowledge Graph v3 has the component ArangoDB 3.x"
"Knowledge Graph v3 has the component ArangoDB 3.6.11"

The first is obviously the most generic but also the one which needs the least maintenance. Here, you would need to point to the "concept" for "ArangoDB" which is version independent.

The second is the most practical if we want to improve granularity to a version level (and therefore disallow to link ResearchProductVersions to ResearchProduct) since at least the metadata wouldn't need to be updated for every minor version -> nevertheless, if we decide to migrate to ArangoDB 4 (which is possible without changing the Knowledge Graph version number since it's an internal dependency), our meta-data entry would need to be updated though. Here the question appears who is actually doing it and how the software team is going to be notified about the upgrade.

The third approach would only be realistic if we would automatically ingest dependency trees (based on the existing mechanisms like Maven / Gradle / npm / ... ) which - imho - is not the purpose of openMINDS. It would definitively not be possible to manage it.

lzehl commented 3 years ago

@olinux thanks for providing this hands-on example. It helps a lot to organize my thoughts better.

I think the key point for software is that there we are talking about dependencies of a software product which has frequent sub-releases that might not all be captured in the KG. The components in such a software were (most likely) not build to serve that software but were produced as independent products, similar to the models in a collection. The difference to the model collection: all software dependencies are needed in order for the main software to work while in a model collection a single model could also be left out without affecting the overall functionality of the model collection (in most cases I guess).

My question here is clearly: should software dependencies on that level be really captured within the graph database or is it not sufficient or even better to document such dependencies within the software repository in the versioned specific software specifications? That does not mean that we may want to capture the dependencies directly in a few cases, but for those I still would think the coarse level you suggest would be sufficient.

I'm asking this for two reasons: on the one hand that level of detail seems to me more on tier-3 level or even beyond (since changes might happen frequently) on the other hand I think we do not aim to register all software out there within the KG in order to cover all possible dependencies of all software products. From your comment above I think you argue in the same direction, correct?

What could be done for software to "outsource" this issue is to allow to point to a "dependency file" for a specific registered software version and to better capture that the repository link of a software product does point to the overall repository and not necessarily the registered version (e.g. the official release of that version).

@jagru20 & @olinux let me know what you think.

jagru20 commented 3 years ago

As far as I understood, the purpose of the current components attribute in software was not to capture all possible dependencies of a software, but rather to yield to other neuroscience-related software that this software uses as a component to function. What we considered as neuroscience-related until now is software, that either already is part of the KG or is to be integrated into it (i.e., no commonly known libraries or services, but other specialized software or libraries).

I am not totally sure, but I think this is also a question about what information we want to deliver in defining another software as a component. Do we want to

just want to give a hint to software, that is related to the software the user is just looking at or
map neurocience-related dependencies?

In the first case, the first of @olinux granularity examples is totally sufficient IMO. In the second case, in my understanding, the SoftwareProductVersion should carry the information, on which version of its component it relies (like the green shade in the drawing). Unfortunately, I don't know enough about software development to be able to assess when such a dependency can change without the version number having to change, and have so far assumed that examples like @olinux's above don't happen. But in that case I would refer to the softwareProduct, in the sense that it could theoretically be the latest version. However, I don't think that would really help as it is to unspecific.

Maybe @bweyers could briefly explain the initial intention behind the Components entry?

lzehl commented 3 years ago

@apdavison , @olinux , @jagru20 , @UlrikeS91 , @skoehnen , @bweyers

I've made the following changes now (within the PR #168): 1) All individual ResearchProduct schemas (Dataset, Model, MetaDataModel, Software) have now a property "hasComponent" that can link to another ResearchProduct of the same type. 2) All individual ResearchProductVersion schemas (DatasetVersion, ModelVersion, MetaDataModelVersion, SoftwareVersion) have now a property "hasComponent" that can link to another ResearchProductVersion of the same type. 3) The property "hasSupplementVersion" does not exist anymore in the individual ResearchProductVersion schemas. 4) The property "hasAlternativeVersion" was changed to "isAlternativeVersionOf" in all individual ResearchProductVersion schemas.

All properties discussed above are of course not required.

lzehl commented 3 years ago

this issue seems to be solved for now therefore I close it. Let see if it will hold up in the use cases

openMetadataInitiative / openMINDS_core

Model collections #163