Global, persistent, unique identifier to available datasets

d70-t commented 3 years ago

As David, I want to have a persistent identifier onto a dataset which I can use in my scripts to directly refer to a dataset in a global, fault tolerant and performant manner. I want to do that because this allows to share data analysis scripts with others without the need for additional descriptions how to manually download the data. I.e. this kind of script should work on any computer anywhere on the world:

dataset = get_dataset("<persistent identifier>")
create_fancy_plot(dataset)

This case of script should also work if any single server (or better datacenter) becomes unavailable. There should also be a guarantee that the dataset doesn't change over time because otherwise my results aren't reproducible anymore.

David talked to Francesca who is amazed by this idea as that would have saved her from the current data mess she has to handle.

joerg-halo commented 3 years ago

Is this use case equivalent to #32?

d70-t commented 3 years ago

Is this use case equivalent to #32?

@joerg-halo no, this is quite different from #32. The result of a search by some standardized names will generally return a list of results. Also this list of results will usually change due to the addition of new datasets or the creation of new versions.

This use case is about accessing exactly one dataset and the exact contents of that dataset should never change, because that identifier might be used in other studies or publications and if the result of the get_dataset("<persistent identifier>") function would change over time, then the results of all studies referring to it would not be reproducible.

An additional note might be that the form of the identifier does not really matter and it might even be beneficial if the identifier itself does not tell anything about the dataset, which is quite contrary to the input of a dataset search query.

Note that this is also different from #7 and #15 in a subtle way. First, DOIs only require that the user is directed to a so called "landing page", which provides some HTML information about the Dataset and maybe links to the actual dataset in some unspecified way. Also DOIs do not ensure that the dataset might change and I have already seen some DOIs which are intended to point to a dataset, but the dataset changed substantially (i.e. a 50% change in dataset size). Third, while DOIs can be redirected from one server to another if the storage location has to be changed, it still points only to a single datacenter location and it still requires the single doi.org-Service to be up and running, which is not very fault tolerant.

bjoernbroetz commented 3 years ago

Question: How can a user trust that a received dataset contains the correct data? Answer: The PID (persistent identifier) may contain a (cryptographic) hash of the dataset. The user can then re-compute the hash from the received dataset and compare it with the PID. If they match, the dataset is correct.

Question: What happens if a dataset should only be used partially (i.e. one variable or a subset). Answer: This can be done if the hash is constructed from a Merkle-Tree (Hash-Tree) spanning the dataset. The tree should match the datasets logical structure. An example of such a tree structure would be the zarr storage specification (v2, draft-v3). netCDF is (/will soon) supporting zarr as a backend.

bjoernbroetz commented 3 years ago

Question: How can redundancy be achieved? Answer: As a client can verify the contents of the received dataset by using the hash, it doesn't matter who is sending the dataset to the client. Accordingly, datasets can just be copied to multiple servers and the client can accept data from any of those.

Question: How does a client find out which server can deliver which dataset? Answers:

There could be a central database (this is bad, as it creates a single point of failure)
There could be a list of servers and the client could just ask all of them for the PID, then choose any matching server. (this scales badly if there are many users and servers, but for a few it might work)
There could be a distributed hashtable mapping PIDs to servers / nodes (e.g. Kademlia protocol, implemented in IPFS (IPFS DHT doc) or in Dat's hypercore protocol) (this scales well to a large number of users and servers)

bjoernbroetz commented 3 years ago

Scope beachten:

Halo-DB
extern System von Identifier soll in der HALO-DB verlässlich gelten. Außerhalb können wir nur überzeugen, haben aber keine Kontrolle.

d70-t commented 3 years ago

Question: How can we use PIDs but still be compatible to datasets from other servers which we might want to include in HALO-DBs search index or visualization / browsing tools? Answer: Accessing datasets could be implemented using a universal function which is able to handle different protocols. That way, if a dataset is referenced by a location based reference (like a HTTP-link), the function can use this protocol, but if it is referenced by a PID, the function can do dataset discovery and verification. To a user, this could look like the following:

maybe_the_same_dataset = get_dataset("http://some reference...")
exactly_the_same_dataset = get_dataset("pid://some pid...")

Note that here pid only stands for one possible protocol.

From within the HALO-DB dataset index, we could handle either reference as just a plain string, so it should be possible to handle both in a similar manner throughout the whole project.

Note that only the pid-Variant will guarantee that the result will always be the same and provides cross-datacenter redundancy and load balancing, which is a great advantage above the http-based solution. This might inspire other dataset providers to follow this idea.

bjoernbroetz commented 3 years ago

Use Case: Als User möchte ich sicher sein, dass die Daten, die ich von der HALO-DB zurück bekomme exakt die sind, die ich hineingeladen habe.

bjoernbroetz commented 3 years ago

Archiv Aspekt: PID für alle Daten zur uploadzeit Such Aspekt: Liste von PID als Ergebnis einer Suche, oder als Liste in einem Paper, oder auf anderen beliebigen Wegen

bjoernbroetz commented 3 years ago

We discussed that storing data and getting exactly the same data back at any later time is the most essential requirement to a data archive. This is the reason for adding the essential tag.

Achieving this goal is technically complex, which we also wanted to point out as a tag. However, for all necessary parts we could think of, there is at least one solution which has been proven to be working. So it is also doable.

We don't think that it is possible to reduce this use case into smaller parts, that is the reason why we have removed the big label.

Some more notes:

It is technical feasible.
A user should be able to verify data integrity upon retrieval.
- Obtaining subsets of the data and verifying that data is possible using Merkle Trees
A hash based solution comes with strong advantages regarding performance and availability. As a user is able to verify the dataset, it doesn't matter where it comes from, so setting up backup servers or local caches becomes about trivial.
It is possible to keep the scope on HALO-DB: we can provide PIDs for datasets on HALO-DB to others and accept weaker references by others (see above) i.e. to support browsing and plotting. We can also demonstrate the benefits of real PIDs to others and probably increase the amount of PID supplying data archives.

d70-t commented 3 years ago

If we have a persistent identifier (PID) to a dataset, it should be really simple to "upgrade" it to a DOI. By prefixing https://doi.org/10..../ in front of the identifier, a unique DOI could be created. Then if a users accesses the DOI, the user will be redirected to a landing page (which could be generated automatically from the dataset contents). If the user in stead accesses the persistent identifier directly, the user can access the dataset directly.

Notes:

having both (DOI + direct PID) is a good thing, as depending on the use case, a user might want to see a landing page or might to work on the dataset directly
DOIs should not contain semantic information. One could argue that putting something which has a different meaning (i.e. PID) into a DOI is already attaching semantic information. However the PID itself would likely not contain any further semantic information. I'd argue that a PID based DOI is actually a very good implementation to create DOIs.

d70-t commented 3 years ago

Question: How could another database / client which is unaware of the dataset discovery mechanism obtain datasets from the HALO-DB? Answer: It would be possible to create a gateway which makes PID based datasets available via HTTP (e.g. like the IPFS gateway functionality). The downside of this approach is of course that some of the advantages which are possible by PIDs will not be available to such users.

d70-t commented 3 years ago

Consent from Zoom Meeting on this topic:

The database shall ensure the integrity of delivered data.
The user shall be able to independently verify the integrity of received data.

We also noted that a possible implementation to verify a citable dataset all the way from author to user may be possible by including a cryptographic hash of the datasets contents into the DOI of the dataset. This is intentionally not a recommendation as we want to give this freedom to possible implementors.

d70-t commented 3 years ago

Probably W3C's distributed identifiers could become a good baseline for the proposed unique identifiers.

halo-db / storymap

Global, persistent, unique identifier to available datasets #20