pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
305 stars 444 forks source link

Record the provenance of a submission when it is added to the system #8389

Open asmecher opened 1 year ago

asmecher commented 1 year ago

Describe the bug At the moment, submissions can be created via 3 mechanisms:

Later, it's sometimes necessary to try to distinguish between these sources using heuristics. For example:

Recommendation:

Suggested vocabulary for the column:

ajnyga commented 1 year ago

How do we treat submissions that were created with Quicksubmit and sent to production and were then worked on in the production stage and published later through the normal workflow? Just a thought, not sure if this needs to affect the issue at all...

NateWr commented 1 year ago

I like it. A few more use cases to consider:

  1. Removing all of the incomplete submissions that get created by normal use of the QuickSubmit plugin. This use case is describe in #7224, and is related to #6528 and #4540.
  2. Tracking imports by SWORD integration.
  3. I think there is a CSV import plugin out there?

Some recommendations:

  1. Change workflow to submitted, since they are being created by the submission wizard not the editorial workflow.
  2. Change quickSubmit-heuristic to quickSubmit-guess just because heuristic is kind of jargon-y.
  3. Instead of unknown-upgrade, it should be set to null. We are going to run into cases where submissions are created through third-party scripts or tools, so allowing the field to be null is a good way to indicate "we don't know".
  4. Let's stick to camelCase, quickSubmitHeuristic, or dasherized, quick-submit-heuristic, but not mix the two.

Finally, while we're doing this, let's add a created_at column to the submissions table. This is different from date_submitted and date_published and could be useful in conjunction with the source column.

NateWr commented 1 year ago

On second thought, what about just quickSubmit-old or quickSubmit-legacy to indicate that they represent historical data?

jonasraoni commented 1 year ago

How do we treat submissions that were created with Quicksubmit and sent to production and were then worked on in the production stage and published later through the normal workflow?

@ajnyga AFAICS it will fail to be recognized as an imported submission only if you unpublish it, then change the publication date to somewhere in the future (after the submitted date), if you're creating new versions it should work fine.

  1. Instead of unknown-upgrade, it should be set to null

Agreed.

  1. Let's stick to camelCase or dasherized

If we're going to write custom keys, I vote for whichever format we're using the most (I personally prefer dasherized for database keys).

Alternative:

Finally, while we're doing this, let's add a created_at

Agreed, I've created a discussion for this here: https://github.com/pkp/pkp-lib/discussions/7977

In the absence of historic data/audit/backups, solving unexpected bugs/issues becomes much easier when you have access to some timestamps.

asmecher commented 1 year ago

@defstat made some suggestions to me on Mattermost that involved creating a better machine-readable audit trail that covered the submission's lifespan in general, not just where it was created. In my opinion that risks becoming much bigger than the use cases documented here, but one way of achieving it without creating a new mechanism would be to use the existing event log toolset. It is already machine-readable-ish, though we do need to improve our current practice around numeric constants for event_type -- it does not grow well e.g. across upgrades and with plugins.

If we choose the event log, resolving this issue could be done by just making sure there are appropriate event log entries for where the submission is created that are different for each source (SWORD, QuickSubmit, XML import/export, etc).

This has downstream benefits of potentially letting us capture the provenance of a submission even if it moves between systems -- for example, a QuickSubmit submission that was later exported and imported via XML.

jonasraoni commented 1 year ago

I think event logs should be used mostly for auditing generic actions (if I wipe out this table, it shouldn't cause much harm to the system). As this information is somehow part of the submission state, and will be used at the statistics page, I think it's better to have a dedicated field for it instead of looking for data in the trunk, but that's just my opinion =]

NateWr commented 1 year ago

I think every action should be recorded in the activity log, including an import, quick submit, "normal" submission, etc. If we need data of the submission state, it should be recorded as last_source or something, to indicate that it can change and doesn't represent immutable data.

defstat commented 1 year ago

My proposal for that is to extend the "audit recordings" (activity log/event log) as needed and then use them as the base/source to populate the (or any related) new column with, for example, the indicator to whether a submission should be taken into account for the statistics. The statistical data will be retrieved the way they are now, no changes into that.

asmecher commented 1 year ago

@defstat's draft PRs:

A few quick thoughts:

The iSubmissionIntroducer interface doesn't add much, in my opinion, since anything that introduces a submission needs to pass itself into the Repository::add function could just as easily pass in a log entry. If the ::add function required a SubmissionIntroducerEventEntry parameter, then it could let the calling code instantiate it. The SubmissionIntroducerEventEntry could have a

final public function getEventType(): int
{
    return SUBMISSION_LOG_CREATED;
}

...and possibly even something like...

public function getSource(): string
{
    return $this->getData('source');
}

public function getParams(): array
{
    return array_merge($this->getData('params'), [
        'source' => $this->getSource(),
    ]
}

This way calling code will need to instantiate and configure a SubmissionEventLogEntry before calling Repository::add. It doesn't build in any requirement that SubmissionEventLogEntry be subclassed, which is good IMO because we can expect plugins to use this facility and their subclasses will come and go.

Down the line I think it'll be sensible to expect the event log DAO to instantiate objects of the right class, but we don't need to tackle that right now.

Feel free to experiment with any of this!

asmecher commented 1 year ago

@defstat, I'm deferring this -- we likely won't have time to get it merged before RC1.

defstat commented 1 year ago

@asmecher New PRs with requested changes

OJS: pkp/ojs#3905 PKP-LIB: #8996 QuickSubmitPlugin: pkp/quickSubmit#72

asmecher commented 1 year ago

@defstat, I'm going to hold off on reviewing this until @Vitaliy-1 has gotten https://github.com/pkp/pkp-lib/issues/8933 merged; it'll probably mean some changes are required here too.

defstat commented 1 year ago

@asmecher I just rebased and forced pushed those PRs.

nils-stefan-weiher commented 2 months ago

Wonderful idea! Thanks for working on this!