qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

bug(collection): `CommitCount`, `RunID`, `RunStatus`, `RunDuration` not tracked properly #1903

Closed ramfox closed 3 years ago

ramfox commented 3 years ago

What does Collection track, and how should it track it?

The collection currently uses VersionInfo to track information about each dataset a user has in their (metaphorical) collection.

Here is each field and if/how it should be tracked in the collection:

InitID event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

Username event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

ProfileID event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

Name event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook event.ETDatasetRename emitted by logbook

Path event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

Published NOT TRACKED

Foreign NOT TRACKED

MetaTitle event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

ThemeList event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

BodySize event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

BodyRows event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

BodyFormat event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

NumErrors event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

CommitTime event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

CommitTitle event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

CommitMessage event.ETDatasetNameInit emitted by logbook event.ETDatasetCommitChange emitted by logbook

WorkflowID event.ETWorkflowCreated emitted by workflow event.ETWorkflowRemoved emitted by workflow

WorkflowTriggerDescription NOT TRACKED. TBH not sure what this is. I'm assuming trigger type?

RunID event.ETTransformStart emitted by transform event.ETDatasetCommit emitted by logbook event.ETTransformWriteRun emitted by logbook

RunStatus event.ETTransformStart emitted by transform event.ETDatasetCommit emitted by logbook event.ETTransformWriteRun emitted by logbook

RunDuration event.ETDatasetCommit emitted by logbook event.ETTransformWriteRun emitted by logbook

RunCount event.ETTransformStart emitted by transform

CommitCount event.ETDatasetCommitChange emitted by logbook

FollowerCount event.ETRemoteDatasetFollowed emitted by cloud event.ETRemoteDatasetUnfollowed emitted by cloud

OpenIssueCount event.ETRemoteDatasetIssueOpened emitted by cloud event.ETRemoteDatasetIssueClosed emitted by cloud

DownloadCount event.ETDatasetDownload emitted by api

Events that we listen to that do not change a particular field

event.ETDatasetPulled emitted by remote - this adds a dataset to a user's collection event.ETDatasetPushed emitted by remote - currently this does nothing event.ETRegistryProfileCreated emitted by registry - this update a username accross all relevant datasets in every user's collections

Current bugs

CommitCount is not properly tallied in logbook

Currently we return the Size or length of the ops in the given log. However, we only want the length of ops of type CommitModel. Easy fix, potentially adding a method or function called Commits that narrows down the ops to only the CommitModel ops

Run field bugs

1) RunID is only captured for runs triggered by the workflow 2) RunStatus is only captured for runs triggered by the workflow 3) RunDuration is not captured 4) "no changes" is never captured

How to fix: Instead of relying on workflow, we should be relying on logbook events. Logbook already tracks this information. It is our source of truth. We already have a precident of using events emitted by logbook to update collection.

In the future it may make more sense to rely on workflow events, but unfortunately, (because of the way save path and deploy path are structured to compensate for both not being able to save without a structure or body & that we call apply inside of save), we will miss certain runs the way things are set up right now. In the future, when we can route all runs through the automation subsystem, and when we can save without a body or structure, then workflow events may be the way to go. Now, however, the only place that has reliable information (and is already keeping track of this data) is logbook.

Tracking successful runs that result in a version We will start relying on the event.ETDatasetCommitChange to send us the run information as well as the verison information. This event is emitted in WriteVersionSave, which has access to the version info as well as the run state of that version (if there was a run). We just need to ensure that if that version did not contain a run, we do not write over any run information when we add to the collection.

Tracking successful runs that do not result in a version We can only know a run has resulted in "no changes" after we have attempted to save the dataset. We call logbook.WriteTransformRun, and the logbook writes a RunModel op. We need to emit an event here ETTransformWriteRun, that has a verison info as a payload.

Tracking a run that ends in error We can use the above ETTransformWriteRun to track runs that end in error.

Tracking RunID We still need to follow ETTransformStart events to capture the run id as soon as possible.