oceanflux-ghg / FluxEngine

Open source atmosphere-ocean gas flux data processing tools. Example uses i) calculating global/regional gas fluxes and net integrated values using satellite Earth, model or in situ data (or any combination), ii) uncertainty analyses (eg ensemble runs, input data uncertainty, model uncertainty), iii) evaluating novel gas transfer parameterisations.
Other
28 stars 16 forks source link

Question about dependencies #60

Closed rabernat closed 3 years ago

rabernat commented 3 years ago

Thanks for providing this very useful looking package. I applaud your contributions to open source software! 👏 :clap:

I'm interested in using this package, but before I install it, I want to understand its dependencies.

Currently the following dependencies are specified in setup.py

https://github.com/oceanflux-ghg/FluxEngine/blob/de872fbba69da575ee2c00bbba25814bc68f5b38/setup.py#L24-L31

I understand why numpy, scipy, pandas are here, since they are likely needed for the core computations. But why the following:

Thanks for considering my questions.

oceanflux-ghg commented 3 years ago

Hi Ryan,

its only needed for the iPhython tutorials, so not needed unless you want to run the interactive tutorials.

again, only needed for the tutorials.

we discovered early on that if you want to encourage difference disciplines to use a common tool, then it needs to be able to handle data formats that they are familiar with. many of the inputs and many of the outputs are netcdf and this is because the tool is designed to be used by modellers, in situ scientists and satellite Earth observation scientists (and any combination of these types of spatial and temporally varying data). netcdf is a standard geospatial data format designed for cross-disciplinary data analysis and exchange and it has climate format standards (CF) which we have tried to follow (although, we can’t be completely compliant as many variable names are non-standard CF names as very few appropriate ‘official' names exist for gas flux variables).

for in situ scientist users, then ascii files can be used for the inputs (see the case studies in the Holding et al paper), but the outputs are first stored as netcdf (and can be converted to ascii if desired). So we use NetCDF as the baseline data format (it also has the advantage of containing meta data and allows you to easily trace the data back to its source eg the meta data contains the complete configuration file and lists all inputs, so the results are completely traceable). Traceable outputs was another important requirement (eg for climate studies and analyses).

hope this helps explain our rationale.

best wishes, Jamie

-- Dr Jamie Shutler Associate Professor of Earth observation Centre for Geography and Environmental Science (CGES) College of Life and Environmental Sciences University of Exeter

Director of Postgraduate Research for CGES

e: @.**@.> t: +44 (0)1326 259212 w: http://scholar.google.co.uk/citations?user=E8ZPisYAAAAJ&hl=en w: http://geography.exeter.ac.uk/staff/index.php?web_id=Jamie_Shutler

‘Think Before You Thank’: Sending emails results in a carbon footprint. In keeping with the University of Exeter's declaration of a climate and environmental emergency I am trying to avoid sending unnecessary emails. Thank you in advance for your important and interesting correspondence.

On 23 Sep 2021, at 13:38, Ryan Abernathey @.**@.>> wrote:

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

Thanks for providing this very useful looking package. I applaud your contributions to open source software! 👏 👏

I'm interested in using this package, but before I install it, I want to understand its dependencies.

Currently the following dependencies are specified in setup.py

https://github.com/oceanflux-ghg/FluxEngine/blob/de872fbba69da575ee2c00bbba25814bc68f5b38/setup.py#L24-L31https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foceanflux-ghg%2FFluxEngine%2Fblob%2Fde872fbba69da575ee2c00bbba25814bc68f5b38%2Fsetup.py%23L24-L31&data=04%7C01%7CJ.D.Shutler%40exeter.ac.uk%7Cc7aa29d801314dabd25008d97e8f0a01%7C912a5d77fb984eeeaf321334d8f04a53%7C0%7C0%7C637679975071579351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0rOfJFgJr5%2FtUR6YD%2Bz81D7lPzarBAXWDKIOxCnx6Lo%3D&reserved=0

I understand why numpy, scipy, pandas are here, since they are likely needed for the core computations. But why the following:

Thanks for considering my questions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Foceanflux-ghg%2FFluxEngine%2Fissues%2F60&data=04%7C01%7CJ.D.Shutler%40exeter.ac.uk%7Cc7aa29d801314dabd25008d97e8f0a01%7C912a5d77fb984eeeaf321334d8f04a53%7C0%7C0%7C637679975071589306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=L3HVZaAvp8XA62uCEshQbYbjpsnW5y8uVmO9NiUeRoc%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACEW73DDY5XERZVO3M4VXLTUDMNSPANCNFSM5ETSWYCA&data=04%7C01%7CJ.D.Shutler%40exeter.ac.uk%7Cc7aa29d801314dabd25008d97e8f0a01%7C912a5d77fb984eeeaf321334d8f04a53%7C0%7C0%7C637679975071589306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fSO1QxsygSvmrXr59iDqYDbw5GoxZhdvV%2FVXRwidm%2Bw%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7CJ.D.Shutler%40exeter.ac.uk%7Cc7aa29d801314dabd25008d97e8f0a01%7C912a5d77fb984eeeaf321334d8f04a53%7C0%7C0%7C637679975071589306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZXLIcpba3SDIH6J1hVd%2FAgycuOxTsUSS2ACq43a3bBk%3D&reserved=0 or Androidhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7CJ.D.Shutler%40exeter.ac.uk%7Cc7aa29d801314dabd25008d97e8f0a01%7C912a5d77fb984eeeaf321334d8f04a53%7C0%7C0%7C637679975071599260%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ibSccm1ib%2FbaaQZM4hO13elKin8nmuvNuwEW%2BgOERVo%3D&reserved=0.

rabernat commented 3 years ago

Thanks for the explanation. I opened #61 to remove the tutorial packages from the dependencies.

As for netCDF, I think I understand where you're coming from. I agree :100: that netCDF and CF conventions are an amazing resource for the community. The best-practice we have tried to advance in the Pangeo project is for i/o operations to live outside of domain-specific packages like this. Data can come in so many different flavors, and this is a rapidly changing area. In our guidelines we encourage packages to consume and produce in-memory data structures (numpy arrays, pandas dataframes, xarray datasets) rather communicating via reading and writing files. This separation of concerns can lead to better interoperability between tools.

As an example of why this is useful, consider the 1 PB of CMIP6 data available in Google Cloud. This data is in Zarr format but can be opened directly with Xarray (without explicit download). How would I go about using FluxEngine with this data? Would I have to first convert it to a local NetCDF file? Or alternatively, would you implement Zarr and Google Cloud Storage support within FluxEngine? Neither is appealing. But if the package communicates via in-memory data structures rather than files, none of this matters. It's the user's job to read and write data.

Anyway, sorry for going on a tangent, adn thanks for listening to my rant. I don't expect you to make such fundamental changes to the package right now.