softwarepub / hermes

Implementation of the HERMES workflow
https://docs.software-metadata.pub
Other
21 stars 6 forks source link

Enhance definition of scopes for steps #75

Open poikilotherm opened 1 year ago

poikilotherm commented 1 year ago

To make sure everyone of us and our users is on the same page what happens where, let's make sure we document properly what is meant to happen in any step.

Example: the process/validate label for a (combined) step for processing data into a unified data model, and the validation of that status. In this case, "process" is not about validation of semantics or syntax (which might need a human in the loop), but instead about consistency of metadata. "validate", however, is exactly about the semantic conflicts within the unified data model (and needs that human).

Example: "curation" vs. "conflict resolution" - we have talked about conflict resolution in the past, which actually is "validation" as in the example above. "curation" is the step of "signing off" on a potential deposit, and may or may not include some part of validation, and additional validation, e.g., as described in #68.

There might be other issues with our implicit definition of steps which we should be more explicit about.

sdruskat commented 1 year ago

One question that follows from this is:

poikilotherm commented 1 year ago

After discussing this further with @sdruskat and @poikilotherm, we come to the conclusion that we urgently need to include @led02 here. We know what we want to do on a meta level with processing, validation, curation, humans etc, but we are not yet clear enough how we structure this to make it into executable code.

sdruskat commented 1 year ago

As a basis for discussion, I propose the following terminology (backed up by the naive mixed-type diagram below):

There are different perspectives:

  1. The high-level perspective, which defines parts of the workflow as nouns: preparation, collation, curation, publication, post-processing. This perspective is for communicatiing what the workflow does, not how it does it.
  2. The user perspective, in which the user is the person starting the workflow, i.e., the source code and metadata "owner".
  3. The implementation perspective, which defines steps of the workflow as verbs: harvest, process, and so on.
  4. The curator perspective, in which the curator is the person deciding if an artifact is published or not.
  5. (A publication repository perspective.)

Of those, we need to distinguish mainly between the high-level and the implementation perspective. Using the terminology proposed above can help navigate between parts and steps.

grafik

sdruskat commented 1 year ago

Additionally, the graphic proposes a specific modularization of steps in the implementation perspective. Again, as basis for discussion, I suggest the following steps (in simplified terms):

  1. harvest: Metadata is collected and put into the data model.
  2. process: An attempt is made to consolidate the data model. This includes:
    • Deduplication: getting rid of unambiguous duplicates, e.g., two person entities in the same metadata fields in different sources (simplified example: codemeta.authors.name: Stephan Druskat and cff.authors.name: Stephan Druskat)
    • Recognition of conflicting values: for example when metadata defines the same field in different sources with a different value (simplified example: codemeta.version: 0.9.3-rc1 and cff.version: 1.0.0). Note: The detection of semantic conflicts is not in the scope of this step, e.g., "are John Kennedy and John F. Kennedy the same person?". Note: Conflict recognition can be configured for this step. This does also include configuration for semantic conflicts! Examples: disambiguation of people with aliases (e.g., JFK, John Kennedy, John F. Kennedy), mail mapping, source hierarchy for a specific field or general (when there is CodeMeta and CFF, always take all values/authors from CFF)
  3. validate: Recognition of semantic conflicts (beyond those that were resolved through configuration in the process step). This includes, e.g., assertion of basic metadata (#68), person disambiguation ("are John Kennedy and John F. Kennedy the same person?"), etc.
  4. report: Collects the consolidated metadata and the recognized conflicts from process and validate into a "report". This report can take the form of a simlpe data model dump, the output of the metadata with conflict markers in a specific format for consumption by other actors (HTML/MD/PDF report files, JSON/YAML/TOML files for input in a curation UI, a pre-formatted email, etc.).
  5. curate: Very similarly to report, this step collects the consolidated metadata (conflict-free or with conflicts remaining) into a report for curation and signing off. Additionally, it may prepare the report for curation through advanced things like diffing across runs (discussed recently, and part of future work).
  6. prepare: Prepares the deposit by requesting metadata requirements from the target publication repo, and configuring the workflow for the respective mapping.
  7. map: Maps the metadata to the target schema of the publication repository.
  8. deposit: Deposits the deposit in the target publication repo.
  9. post-process: Runs any post-processing substeps, e.g., updating metadata, alerting co-authors, etc.
sdruskat commented 1 year ago

(One could map those steps straight into the implementation as extension points.)