HALO DB searches on other databases

joerg-halo commented 3 years ago

As scientific user I uploaded my data already to another database, because all data from a multi-platform campaign are stored there. I want to make this dataset accessible for the HALO database without uploading it twice. One of the benefits would be, that I do not need to bother in case of new data versions, because both repositories will always show the same dataset.

d70-t commented 3 years ago

I think this is a really important one. It would be even better if the dataset could appear automatically in other databases. And it would be great to do this bidirectional. So datasets which are in the HALO DB could easily appear in other databases.

joerg-halo commented 3 years ago

Protocol from tandem session with @rico-hengst and discussion within the database workgroup

The support of protocols for metadata harvesting and a implementation of interface(s) in order to allow the HALO-DB to search on other databases and vice versa is recommended.
Thus, datasets could be stored decentralized and acessed via metadata harvest.
Several metadata standards (Dublin Core, etc) exist. To enable the HALO-DB to harvest metadata on third party databases, the support of the most popular metadata standards should be sufficient. The protocols for metadata harvesting are made for standardized metadata only.
Whether search results from other databases are displayed directly or if the user is directed to the website of another database, depends on the database where the data are stored. If this database allows for it, the search entered in the HALO-DB can yield the dataset directly.
When data are stored outside the HALO-DB, it has to be ensured, that they can be acessed for at least 10 years.
With the implementation of this feature, large datasets can remain at the datarepository of the institute of the data provider. Hence, this touches also use case #51 (Provide large datasets).
The effort to offer standardized metadata and to implement protocols for metadata harvesting is moderate, but exceeds clearly the capacity of the regular database work capacity. Hence, an additional funding for the implementation should be acquired.
The use of the CF-Conventions (Use Case #32) is required. The compliance can be checked at the indexing of the data.
The support of metadata standards and the support of protocol for metadata harvesting could be implemented on the current HALO-DB. Hence, this point does not touch the question of a possible remake of the HALO-DB.

joerg-halo commented 3 years ago

@d70-t We didn't get, what you mean with "appear automatically". In my understanding, I start a search and get results from the other database listed. New entries of further datasets will appear in my search, when I search again. I am not sure, if you have another mechanism in mind.

d70-t commented 3 years ago

I didn't think about it very long and especially I didn't think about that in terms of a technical solution. I'll try to write a little bit more about what's in my head, hoping that this helps in your discussion:

As a user, I'd expect that I can search for things like campaign == EUREC4A && quantity == temperature oder region ~ antarctica && thickness of ice && date older than 1900 or the like. I wouldn't want to care about where the dataset is stored and if I put my search into duckduckgo, ecosia, google or bing (those are meant as aliases for HALO-DB, pangea, Aeris etc...). I also wouldn't want to care about how the search results and / or the data returned by the search engine looks like (i.e. being redirected to different looking landing pages or to differently formatted datasets is not a good thing). And I wouldn't want to contact all the search engine manufacturers to include my shiny new data once I've uploaded it. Thus what I'd really want as a producer is to put my data somewhere and someone else should be able to find the data from somewhere else without me knowing what somewhere else is, and vice-versa for the consumer.

It might well be that this is just exactly what you discussed.

Some technical ideas:

I don't think that the thing which stores datasets and makes them available to others and the thing which provides search / index / data discovery features have to be the same thing. I even think more and more that it is a bad design, if they are strongly tied together. However, it might be that the HALO-DB is something which does both, store and index (1).

The first thing from a user perspective (relevant to something like the HALO-DB) which I assume will happen in some form is that a dataset creator will make a new dataset available to the store component. This may be though an upload form, but it could also be something else. We've talked from time to time about CF-Conventions (you mentioned them above as well) and netCDF datasets. I think that CF-Conventions are tied closely to the ideas behind netCDF and to quote the first line of the netCDF User's Guide:

The purpose of the Network Common Data Form (netCDF) interface is to allow you to create, access, and share array-oriented data in a form that is self-describing and portable.

Thus I assume that any dataset which we like to talk about and which should be retrieved from a store component should be self-describing and portable.

Accordingly, it seems to be a reasonable implementation for an index component to look at a store component from time to time and it could create some form of index from all the information present in those stores automatically. Also it shouldn't matter where the dataset is actually stored as the store is separate from the index anyways. Thus, if there is a clear separation between the storing and the indexing component, it just doesn't matter if the HALO-DB indexing component looks only at stuff which is stored in the HALO-DB storing component or if it also looks at different places. And if that is the case, data which is placed elsewhere will show up in HALO-DB and data which shows up in HALO-DB will look the same, independently from where it actually comes from.

(1): I tend to think of a third separate thing, the display component (i.e. HTML output, maps, plots etc...) which could also be part of the HALO-DB.

halo-db / storymap

HALO DB searches on other databases #53