microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.35k stars 300 forks source link

Redistribute datasets and models on Hugging Face #1073

Open adamjstewart opened 1 year ago

adamjstewart commented 1 year ago

Summary

We should consider redistributing as many datasets and pre-trained models as we can on Hugging Face.

Rationale

Hugging Face provides a more reliable centralized repository for storing large binary files. It's a large company, so we don't have to worry about expired SSL certificates or servers going offline. We have full control over the files we upload, so we can make modifications (license permitting) to fix inconsistencies between model architectures.

It also provides significantly faster download speeds compared to similar sites. For example, for our ResNet-50 pre-trained weights (~100 MB):

For the EuroSAT dataset (~2 GB):

Implementation

First, we need to ensure that the dataset or model we are redistributing has a license that permits redistribution. If a license is missing or does not permit redistribution, we should reach out to the authors to see if a permissive license can be granted.

Once licensing is settled, we just need to upload the dataset or model to Hugging Face. The license chosen should match the original license. Any modifications from the original should be clearly documented, and a link should be added to the original source. This is required by many licenses, and is just a good idea to document in general.

Finally, the URL (and possibly MD5) in TorchGeo should be updated to point to the new download location.

Alternatives

We previously used Zenodo for this but download speeds were abysmal. A quick survey of UIUC AI PhD students found that everyone uses Hugging Face 🤗

Additional information

We already have quite a lot of datasets, and dataset authors are often unresponsive to these kinds of inquiries. It's likely unrealistic to expect that we'll be able to redistribute every dataset and model, so I won't start a checklist just yet. High priority datasets and models include:

Again, we have to check the license first. Many datasets that cannot be automatically downloaded are for legal reasons.

adamjstewart commented 1 year ago

Starting a work-in-progress list so that multiple people don't contact the same person.

Datasets

In-progress

Source License Reason
USAVars Not sure yet Slow and failing download

Completed

Source License Reason
EuroSat EU Law Expired SSL certificate
UC Merced public domain HTTP-only

Models

In-progress

Source License Reason

Completed

Source License Reason
Zhu Lab CC-BY-4.0 Required modifications
ServiceNow Apache-2.0 Required modifications
calebrob6 commented 1 year ago

I think DynamicEarthNet is re-distributable (based on a conversation with @lukaskondmann)

lukaskondmann commented 1 year ago

This is correct. DynamicEarthNet is available under this license so redistribution is possible as long as attribution is given

calebrob6 commented 1 year ago

So2Sat is okay to be mirrored based on https://github.com/microsoft/torchgeo/issues/388.

adamjstewart commented 1 year ago

From email conversations, OSCD and HRSCD both have CCA licenses which freely allow redistribution.

ReforesTree may require permission from the authors. They have a shared data agreement with WWF. They were able to redistribute on Zenodo, but we should check back with them to see if we can redistribute on Hugging Face.

nilsleh commented 1 year ago

@calebrob6 I would like to redistribute the USAVars dataset if possible because download is super slow and failing several times. However, I am not sure what the actual source of this dataset is since it is only a reproduction. I saw that you had a repo about the paper, so wondering if you know something about the source and license of the torchgeo USAVars dataset?

calebrob6 commented 1 year ago

Hey @nilsleh, yes, I helped create that dataset. We should definitely move it to HuggingFace. @estherrolf is soon going to make changes to the dataset so perhaps we can do that all together.

yeelauren commented 1 year ago

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

adamjstewart commented 1 year ago

Hugging Face has a maximum individual file size of 50 GB 😢

kbgg commented 1 year ago

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

We're aware of these issues, it's due to a combination of issues ranging from architectural limitations to issues with Azure blob storage which haven't been resolved yet. We're working on an updated version of MLHub which resolves these issues which will be available in the near future.

nilsleh commented 1 year ago

With #1240 merged, can we move the USAVars dataset to HF? Because at the moment the download keeps failing through torchgeo. I still have the dataset locally, so I could upload it to HF and open a PR to change the download links :) @calebrob6, @estherrolf

adamjstewart commented 4 months ago

USAVars is CC-BY-4.0, so yet we can redistribute on HF if you want.