mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
33 stars 7 forks source link

Mirroring datasets on Source Cooperative? #186

Open cboettig opened 1 year ago

cboettig commented 1 year ago

Hi friends, have you seen the recently launched beta for Source Cooperative from the non-profit group RadiantEarth?

I believe it would be possible to mirror large public-domain spatial datasets that are available only from low-bandwidth providers (like Zenodo or various university servers) on repositories on Source Cooperative, which is backed by AWS S3 buckets, with no charges for storage or egress. (Source Coop is led by Jed Sundwall, who until recently ran the public data program for AWS). Mirroring tasks could potentially be automated as CRON jobs, and would open the door to other post-processing, like converting into COG; similar to what Planetary Computer does but in an open, distributed platform.

Maybe mirroring datasets in this way is maybe way out-of-scope for the mapme initiative, but thought might be an interesting resource to be aware of if you hadn't seen it.

fBedecarrats commented 1 year ago

This is very interesting, thank you.

goergen95 commented 1 year ago

Thanks, Carl, for suggesting this. It could also benefit users behind firewalls if we were to compile a common data source. CC'ing @Jo-Schie, since we were discussing this. Maybe the source cooperative is viable solution? As mentioned, we would be restricted to public domain data sets or other licenses which allow the re-compilation of the data. From a technical perspective, it could also be very interesting, because we can decide for cloud-native data formats, thus potentially making the get_resources() call if not obsolete at least optional in this scenario.

Jo-Schie commented 1 year ago

Hi all, and thanks for this information. Really great. I discussed it with @goergen95, and it seems to us that we could probably use the available framework easily to create a data-dump to the mentioned source...basically by creating one large portfolio object (the world) and downloading all datasets regularly ...

We would probably have to agree on a temporal resolution for the update periods, and also create some backend work to save the data to the required cloud-native formats. Then, of course, one would also need to contact all the data-providers and ask them for permission to mirror their data-resources, probably with some usage agreement to be on the safe side.

As @goergen95 rightly mentioned, this would also ease the challenge of strict firewall rules in the corporate context. IT departments would basically need to unblock only one trusted source, and they would not face all the certificate issues and weird DNS sources that appear only in part trustworthy.

I am currently drafting the project proposal to extend the framework in 2024 and 2025, and I will include these aspects.

fBedecarrats commented 1 year ago

Great! Note that a major difficulty might be on the legal aspect. Some sources might not authorize others to re-distribute their data. The project proposal should inclue de legal component to assess this aspect for every source.

Jo-Schie commented 1 year ago

Great! Note that a major difficulty might be on the legal aspect. Some sources might not authorize others to re-distribute their data. The project proposal should inclue de legal component to assess this aspect for every source.

Yeah. And the research needs to be done beforehand, in order to avoid investing in something that is not feasible in the end ... for legal reasons... open data not so open after all. That would be a pity.

cboettig commented 1 year ago

Great discussion here, sounds very exciting. Definitely agree about the legal concerns. Many datasets are already clearly marked in the public domain or under reusable licenses, but many are not.

A natural way to document all available resources in mapme, including the license the resource is listed under, would be to use the a simple STAC-collection JSON file. Note that this JSON file could be placed on source cooperative, regardless of whether the data itself was also uploaded to source cooperative. (as you probably know, STAC files can use an href field to point to where the actual file assets live, which could be on source coop where permitted, or on the website of an upstream provider where it is not, or both).

At the risk of making this even more ambitious, it may be possible to work with some of the upstream providers to convince their teams to distribute data through source-cooperative directly? While it would probably be a very new toolset for many of them, the process is also relatively simple and streamlined. I am speculating here, but source-cooperative may be able to help navigate the process of onboarding such providers -- it's founder is the guy who launched and ran the AWS open data platform for over a decade so presumably has done this kind of thing before?

fBedecarrats commented 1 year ago

@Jo-Schie what do you think? This seems very ambitious and beyond the scope of Mapme, but desirable anyway? Would it be appropriate to join forces with other partners to push this agenda? Besides ideological/strategical arguments of openning data on such a cooperative platform, what would be the practical interests of doing so?

Jo-Schie commented 1 year ago

Hi both. Yeah...this is a very valid point and I thought of that as well. Nevertheless, I fear very much the dependency on others and the missing flexibility. Many data providers will probably not have enough time or resources to change their publication scheme... especially in public institutions, the situation with IT-ressources might be daring. So even if we could convince them, this would probably take a lot of time until it is really implemented.

I think, nevertheless, that the whole process and us contacting them to use their data (and explaining to them how and why) can already be a good starting point to sensitize the data-providers. So maybe we can try a dual strategy where we can work with the data already in the short term but convince them in the long term to join the platform.

Btw. as with cloud-native formats, it would be also great if the dataproviders served their data with OGC standards (WMS/WFS/WMTS) to the public. This would allow building usefull webmaps in reports (e.g. with leaflet) displaying the original input data to the user. So far, from our resources, only WRI (Global Forest Watch) offers this. It's a similar topic.

goergen95 commented 4 months ago

Just wanted to bump this conversation as with v0.8.0 released, this is now technically much more feasible compared to when this discussion started. We now also include an overview of the licenses for the resources we support in the README.

I also very much like the suggestion of referencing resources in a STAC catalog, however, doing so would yet again require re-working how we fetch resources (though with relatively low complexity). To have a practical example of what that would look like here is the rendered version of a catalog referencing data sets from diverse sources with the JSON found here (thanks @cboettig, this is really an interesting resource!).