ioos / soundcoop

This repository contains Jupyter notebooks developed by the passive acoustic community for the SoundCoop project.
BSD 3-Clause "New" or "Revised" License
13 stars 4 forks source link

mybinder data storage limitation #9

Open cparcerisas opened 2 months ago

cparcerisas commented 2 months ago

@carriecwall @carueda @danellecline @KarinaKh

So I have been playing around with options on how to deal with the large amount of data we need for the pypam and the env notebooks. I have several proposals, but I got a bit stuck on some of them. I list them here:

  1. Use less data for pypam for a mybinder version - one month of each station? and just let people download it when using it. Pros: easy and already implemented. Cons: we don't get the full year analysis, which is a bit of a pity for the pypam plots but we'll survive -> to make it even faster better to make the environment notebook and pypam's use the same data
  2. Using the postBuild function - means that mybinder downloads the data BEFORE the image is created, which means that a HUGE image is created, but once loaded it should work? Pros: it's an elegant solution Cons: I am not sure it will load for everyone a 20 GB image. See branch pypam/binder_docs https://github.com/ioos/soundcoop/tree/pypam/binder_docs -> to make it less heavy, would be better to make the environment notebook and pypam's use the same data
  3. I tried downloading the files and removing them afterwards, both by using a modified version of open_mfdataset and by just removing all files after a certain number has been downloaded (and stored in memory). This should be a feasible solution but for some reason my kernel keeps dying when the files are removed. Help here?

Let me know the preferred solution, or if anyone has any possible improvements/suggestions for any of the options

danellecline commented 2 months ago

@cparcerisas, my vote would be 1) with the stated caveat that it is due to limitations in the binder environment.

Strategy 3) is ultimately the more robust one(no particular environment needed). The kernel may be dying because of memory limitations, but I need to understand the processing code in more detail to say for sure. Closing any open datasets will force the memory to be free.

cparcerisas commented 2 months ago

@danellecline yes I played around with it, you can see it here: https://github.com/ioos/soundcoop/blob/pypam/binder_docs/2_analysis_of_HMD_pypam/data_analysis_with_pypam.ipynb Functions load_data_station_slow and load_data_station