Dataservice query on update

mattions commented 6 years ago

Once a new release is made, the client of the data coordinator will came to the data service to collect the data.

Would be possible to have an API that returns the difference on the study by the file? Something like:

GET studies/{kf_id}/difference?from={revision}&to={revision}&by=file

where the revision is whatever revision system we decide to adopt according to https://github.com/kids-first/kf-api-release-coordinator/issues/39

What would be good to get, at least from the Cavatica point of view is a response that brings this kind of information

[{"new": [file1, file2, file3]},
{"updated" : [file4, file5, file6]},
{"deleted" : [file7, file8]}
]

where file1 and so forth are the url to the file where all the other metadata can be extracted and normalized

dankolbman commented 6 years ago

We don't plan on exposing historic versions of data directly inside the dataservice at the moment. It's more intended to be the hot version of the data, and releases will be cut as snapshots, so the dataservice itself won't really have a concept of versions.

It may be possible, though, to compile diffs like this via event logs, which we do not yet have implemented.

nikolamir commented 6 years ago

Hi @dankolbman , we are not asking for this because of the need to have an access to historic versions of data. Instead, we are thinking in terms of being able to perform an update to the latest version (release) of the dataset most efficiently. Is it possible for you to send us (as a part of the task) the list of files (Gen3 UUIDs) which are added/removed/changed? Maybe something like Release Notes which are available on GDC Data Portal for each new Data Release? Using the solution that you currently proposed, we would have to check for the files which are part of the study which is updated (this information is not maintained on our side so it's not efficient to query it), find out which files are added, which files are removed, what metadata is changed for existing files... Looks like re-importing complete dataset and switching to that latest version is easier thing to do, but we think that it would be most efficient if you could provide us list of changes which should be applied to the current version of the dataset.

One more question: does changing the metadata for the existing file means that this file will obtain new UUID or specific file can have different metadata keys/values during time while maintaining the same UUID? In other words, which are possible changes for a single file? Besides adding or removing a file to a dataset/study, is updating an existing file possible (and, if yes, which are possible changes - only changing the metadata or changing the content of a file, as well)?

znatty22 commented 6 years ago

hey @nikolamir, when you say metadata for the file, are you talking about just the file attributes like file name, file size, hashes, etc. or are you also talking about the associated clinical metadata for the file like participant attributes (proband, demographics, etc) biospecimen attributes, etc?

mattions commented 6 years ago

@znatty22 we are talking about the all of them.

dankolbman commented 6 years ago

We see what you're getting at with being able to compute diffs. We could implement filters by created_at and modified_at times to determine what has changed since a release. That would be the first step in computing the diff. We won't be able to provide the diff to specific metadata tied to a file. The complexity of that implementing that operation is likely far outweighed by the ease of reloading all of the metadata related to the file. In the case of updating the metadata of a file that has not been changed itself (only it's clinical info has changed, say), that will be equally difficult to do. It would seem that it's probably best to simply reload all of the files for the first round of development and we can tweak as necessary.

The UUID and kf_id of the files and entities themselves should never change for the same entity. We may expect that any attribute of the file may be changed, file_name, size, urls, etc. and that the contents of the actual file may also be changed. The metadata, meaning any related entities to that file, may also change, though the only way to tell that would be to join across the metadata of interest and inspect the timestamps on them.

allisonheath commented 6 years ago

Chatted with @mattions a bit about this - the challenge here is unlike with the portal, on Cavatica users already have a handle on a file and have possibly done some processing on it. So if something gets updated with a previously imported file, there needs to be the ability to know (and thus indicate to the user):

If the file itself is no longer accessible or has been replaced with a new file
If the metadata on the file (which in this case can be any of the entities and their properties associated with the file) has been changed

For 1) I believe the principle is that in Gen3 all generated UUIDs for files are persisted, the fact that the file once existed should always be available, even if the specific file may no longer available. Then we can use the ability to find the latest version of the file and provide that as part of the release. If for some reason the file is no longer available in any form, we should have the ability to provide that and a reason.

For 2) I think we should be able to provide a list of entities (and their IDs) that changed from the last release, but at the moment we need to push the use cases of determining how that maps to specific files and which specific fields get updated to the client side.

allisonheath commented 6 years ago

Per the discussion during the tech call today - the first version of the dataservice/coordinator will just support a full refresh on a per study level. Once that is solid then we can revisit the ability to provide identifiers of what has changed between releases.

kids-first / kf-api-dataservice

Dataservice query on update #251