discussion: dataset history and cloud metadata & its storage

qri-io / qri

you're invited to a data party!

https://qri.io

GNU General Public License v3.0

1.11k stars 66 forks source link

discussion: dataset history and cloud metadata & its storage #1880

Closed ramfox closed 3 years ago

ramfox commented 3 years ago

type Uberdata struct {
  InitID string
  WorkflowID string
  Title string // title at HEAD
  RunCount int // number of runs
  HeadSize int64
  CommitCount int
  DownloadCount int
  FollowerCount int
}

Other field candidates (based on what we expect on the dataset page)

Automated boolean (can be infered if there is a workflowID)
Private boolean (privacy always starts a sprawling discussion since we have many definitions of what "private" will mean at some point in time)

1) Does this belong in the collection Our easiest bet would be to stick this struct in the collection. My main concern is that for the most part the collection is a cache for information we have available stored in other places. If the collection is only intended to be a cache, then we need to make sure all of these other pieces of information are safely stored elsewhere. This isn't a problem for fields that exist in other stores, but potentially an issue for the cloud related fields.

2) Which leads me to a Cloud related fields question, what do these mean in a local context? Are they zero? FollowerCount and DownloadCount may not make sense locally, but for a remote (non-cloud remote), these might be things we can keep track of. If so, we need to have systems that track and store these as well. Is that just the Uberdata store, or should we be expecting a flurry of remote systems that note when someone follows or downloads a dataset.

4) What are we naming this thing? Candidates so far: AggregateData, Uberdata, DatasetHistoryData, DatasetAggData

5) Another cloud related feature, but does something like Issue Count also belong here, when issues are implemented?

Arqu commented 3 years ago

Here's my take on it.

Think Automated is just a nice to have for future API users and should be a computed field not something we hand roll.
For Private, I'd skip including it for now. We don't have a clear use for it yet and will probably start using it in like 5 different subtly different contexts once we start getting closer to that. Currently all data is public and that's the end of it until we have the full infrastructure for it.

1) Honestly, think the main discussion point IS "what is collection". In the past 2 weeks we've started leaning A LOT on it. Personally I'd say VersionInfo +/- a small bit would be the best "core" structure for collection and have the rest populated from other systems. I wouldn't cache it, just have stats live in stats and if they are cached there, then great if not, that's stats problems not collection problems. As such I think all the other semi/ux related fields should go into a subsection and be plug and play. Issues, following, metrics etc should all be plug'n'play features of qri and if the implementation is not provided, it just doesn't appear further in the code. BUT there is a strong need to surface them if they are present, a decent chunk of the cloud UI is tacked on in the wrong places to fill in the demand. DatasetPreviews are the ones most bent to that shape, which is wrong. Collections are a good spot to present it, but I'd keep it as a lower tier item in the struct.

2) Think 1) kinda answers it, basically those should not be provided/used/populated if the implementation is not provided.

3) No preference, also no idea. Maybe collection -> horde, item -> orc and just have fun with it :P

4) same as 1 & 2

chriswhong commented 3 years ago

Not much to add on the implementation, but I do think this is also where we surface issueCount and meta.description in case we ever need it. Will add more fields here as I come across needs in the wild.

Right now in list we have the runStatus and runID, but I am finding that I also need the runTime and the runDuration. (note, we are using commitTime for the moment that runs started in /activity which is kind of confusing since there may not be a commit)

ramfox commented 3 years ago

@Arqu I like the idea of having a core implementation and then space/flexibility for subsystems to track other information. Right now, collection functions by listening and responding to events. As long as we know what fields are associated with what events, we can set up a system that won't break if the event doesn't ever fire. This would require coordination and expectations from cloud.

Just to lay it out these are the fields that don't currently exist in VersionInfo, but are expected for AggregateData:

`RunCount`
`CommitCount`
`DownloadCount`
`FollowerCount`

Side note: currently the collection is where we get lists of dataset information, the only endpoint/fetching we have is List. This would need to change to satisfy the frontend, which needs to get the AggregateData via the ref or initid. Not a huge problem, but this is a shift in expected behavior. It would also point toward having a separate section for this AggregateData information

b5 commented 3 years ago

The reason collection is the best candidate for this: it's both a cache and inherently oriented toward dataset HEAD versions.

One option: we could have this new struct embed a VersionInfo, and have collection track that struct instead:

type DatasetInfo struct {
  dsref.VersionInfo

  RunCount int
  CommitCount int
  DownloadCount int
  FollowerCount int
  OpenIssueCount int
}

Upsides of this approach:

much easier to keep fields de-duplicated
current list operation can just strip off the outer struct
very easy to just swap the data model for list if that ends of making more sense

ramfox commented 3 years ago

Proposed new events to track DatasetInfo fields

// ETDatasetDownload indicates that a dataset has been downloaded
// payload is an `InitID` string
ETDatasetDownload = Type("dataset:Download")

// ETRemoteDatasetFollowed indicates that the dataset has been followed by a user
// payload is an `InitID` string
ETRemoteDatasetFollowed = Type("remote:DatasetFollowed")

// ETRemoveDatasetUnfollowed indicates that the dataset has been unfollowed by a user
// payload is an `InitID` string
ETRemoteDatasetUnfollowed = Type("remote:DatasetUnfollowed")

// ETRemoteDatasetIssueOpened indicates that an issue has been opened for this dataset
// payload is an `initID` string
ETRemoteDatasetIssueOpened = Type ("remote:DatasetIssueOpened")

// ETRemoteDatasetIssueClosed indicates that an issue has been closed for this dataset
// payload is an `initID` string
ETRemoteDatasetIssueClosed = Type("remote:DatasetIssueClosed")

RunCount field should be listening to ETTransformStart events. This, however, will track both apply runs and save runs, so until we settle on a method for distinguishing both this may be inaccurate. (Unless we decide RunCount applies to both sorts of runs, in which case, we are good). Further, does the run have to be successful to be counted? If so ETTransformStop events would be better.

CommitCount should be listening to ETDatasetCommitChange

This is predicated on the idea that cloud or a remote would be able to publish an event that collection can respond to.

Arqu commented 3 years ago

ETDatasetDownload would be handled on the core side as the download request is just piped into dataset.get with format=zip
The rest is pretty easy, I already have some "cloud" events which would be great to move out of cloud and down into core.

ramfox commented 3 years ago

Still on the proverbial board for future collection refactors:

1) refactor LocalSet to de-normalize usernames (and perhaps dataset names?) 2) switching from dealing in dsref.VersionInfos to DatasetInfos as detailed by this discussion 3) if there are any "custom" events that we feel belong may belong in cloud but not implemented by core, cloud can use the same underlying collection.Set to record information. If this refactor occurs, we may want to expand DatasetInfo to include a custom map[string]interface{} field that cloud can record to.

ramfox commented 3 years ago

Change of plans after convo with @b5

We do not have much user feedback about what fields or other bits of info they may want us to keep track. Since this is our first iteration of adding this aggregate information, let's keep things fast and simple for now.

We are going to expand VersionInfo to include these "aggregate fields", with the knowledge that when things are settled, these fields will pulled out into (perhaps) their own data structure, store, or subsystem. These additional fields will only be expected to be filled inside the collection package. (Must ensure we json:",omitempty" these fields). We will have TODOs about moving these fields once we have more solid information about how/if they will be used.

We will be adding an endpoint /collection/get that returns a single version info based on the Ref or InitID. This is the endpoint frontend will hit in order to get this aggregate data.

type VersionInfo struct {
        ...
        //
        //
        // Aggregate Fields
        // TODO (ramfox): These fields are only temporarily living on `VersionInfo`.
        // They are needed by the frontend to display "details" about the head of 
        // of the dataset. When we get more user feedback and settle what info
        // users want about their datasets, these fields may move to a new struct
        // store, or subsystem. 
        // These fields are not derived from any `dataset.Dataset` fields.
        // These fields should only be used in the `collection` package.
        //
        // RunCount is the number of times this dataset's transform has been run
        RunCount int `json:"runCount,omitempty"`
        // CommitCount is the number of commits in this dataset's history
        CommitCount int `json:"commitCount,omitempty"`
        // DownloadCount is the number of times this dataset has been directly
        // downloaded from this Qri node    
        DownloadCount int `json:"downloadCount,omitempty"`
        // FollowerCount is the number of followers this dataset has on this Qri node
        FollowerCount int `json:"followerCount,omitempty"` 
        // OpenIssueCount is the number of open issues this dataset has on this
        // Qri node
        OpenIssueCount int `json:"openIssueCount,omitempty"`
}