stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

Data release for full reproducibility #29

Closed valentinitnelav closed 1 year ago

valentinitnelav commented 2 years ago

I like to make our paper as reproducible as possible and that means also offering easy access to data. I am often frustrated when I see papers with code, but no data :D

There are two ways we can do this:

1) Store our data on CERN's ZENODO. In their General Policies it is mentioned that "Volume and size limitations: Total files size limit per record is 50GB. Higher quotas can be requested and granted on a case-by-case basis." So that is enough space for us. P1 dataset will be max 12 Gb when we include the Syrphid dataset for testing. I would also suggest providing for each image its download URL from GBIF and its license. I can take care of that.

2) Provide only the URLs from GBIF that I used for downloading in a file that links each image file name to its URL from where it can be downloaded. In this way, we might avoid any issues with licenses regarding redistributing data, even though I tried to have permissible creative commons licenses only. However, I cannot ever be 100% sure about such legal stuff. While writing this, I start to like more this idea. We can provide a script that downloads the images from each URL.

One risk with this approach 2) is that image URLs might change and then things break in the future. As far as I understand, GBIF is not responsible for image storage, they only store ecological metadata, including URLs to images. Images are stored by organizations like iNaturalist and Observation.org that upload curated metadata to GBIF. Each one of them can decide to change how they store image data and therefore change or break URLs. Perhaps, this is not a big problem and URLs are "permanent", just thinking out loud for now.

stark-t commented 2 years ago

@valentinitnelav I really like the idea of having the dataset directly available if that is possible :) I'M not sure how concerned we shold be reagaring the image license...

valentinitnelav commented 2 years ago

Ok, then we do that. I think I feel more comfortable with also uploading a table that gives for each image the license info and its URL as well. So, we go with ZENODO then as it is more direct for downloading and also offers data permanency.

valentinitnelav commented 1 year ago

An extra thought - we could also release it on Kaggle (on top of having it on ZENODO). There are not many datasets regarding insects on Kaggle, and just one specific for pollinators:

valentinitnelav commented 1 year ago

Final decision: we will not redistribute the images due to legal concerns. We will provide the URLs with the metadata (license type, taxa info, publisher, author if available, etc). We will also not share the weights for the moment.