large input data - Githubissues

ukri-excalibur / excalibur-tests

Performance benchmarks and regression tests for the ExCALIBUR project

https://ukri-excalibur.github.io/excalibur-tests/

Apache License 2.0

19 stars 16 forks source link

large input data #113

Closed TomMelt closed 1 year ago

TomMelt commented 1 year ago

Do anyone already have a plan for centralized input data? I had a look at open/closed issues and couldn't see anything.

The Ramses code requires ~4Gb of inputs.

As I see it we have a couple of options:

Git LFS
Central server which we can wget files from

For now I am managing with a manual download of input data but it is not ideal going forward. Is Ramses the only code with this issue?

TomMelt commented 1 year ago

I guess I don't have permissions to edit issues with labels. If someone could add a UoL / Leicester label I'd appreciate that.

giordano commented 1 year ago

For large input datasets, so far we've been relying on being able to download them from the internet, see for example https://github.com/ukri-excalibur/excalibur-tests/blob/d9dd093296aa7c4a6fd96ec152cb2edba7ffa264/apps/openmm/openmm_rfm.py#L20-L34 in recent PR #115, or https://github.com/ukri-excalibur/excalibur-tests/blob/1d45e360e15e46b09f24a02e011835fa00cda8a5/apps/wrf/wrf.py#L100-L113 in the WRF benchmark.

TomMelt commented 1 year ago

The issue is, the current location requires signing in and entering a password (University of Edinburgh's DataSync service). I was wondering if Excalibur would host a central server (exposed to the internet) that we could place data like this in. I am not sure of a sensible place to put the input data that is publicly available to everyone.

I will have a chat with the team here in Leicester and see if there's anywhere we can put it for now that doesn't require password access.

giordano commented 1 year ago

I think Zenodo can be an option which doesn't require us to host any infrastructure, at least up to 50 GiB per dataset.

ilectra commented 1 year ago

Input and output data storage is not within the scope of the current project. It's definitely something ExCALIBUR should deal with at some point, but not yet. As Mose said, we assume the benchmark code providers would also provide the data in some downloadable location. The solutions you propose make sense. Why would Ramses input data be behind a password anyway?

TomMelt commented 1 year ago

It doesn't need to be but for some reason (before I joined) it was placed on University of Edinburgh's DataSync service. I want to move it off of that system because it doesn't require password protection but I have nowhere else to put it.

I will look into using zenodo. That will do for now. If we get an ExCALIBUR server in the future we can move it there.

Thanks both