"Big Gridded Data": Distributed Cloud Storage for Physical Oceanography Data

mwengren commented 3 years ago

Project Description:

Storing highly-voluminous and highly-dimensional data has always presented challenges, and while hardware advancements have eased some of the burden, software remains the critical component in data management systems. This project will explore burgeoning solutions in the big-data realm to store massive volumes of highly-dimensional numeric data across distributed cloud platforms. Participants will examine tradeoffs between technologies and develop deeper understanding of how new data storage and access solutions may be implemented in the oceanography industry.

Expected Outcomes:

A software cost-benefit analysis of data storage and access scenarios.

Skills required:

Familiarity with Linux/UNIX operating systems and a working knowledge of Python, C/C++. Understanding basic database architecture is a plus.

Difficulty:

Moderately difficult

Mentor(s):

@daltonkell Dalton Kell (Software Engineer), @benjwadams Ben Adams (Software Engineer)

vkrm1612 commented 3 years ago

Can i work on this project

Arnold2381 commented 3 years ago

Hello, sirs, I am a 3rd-year college student from Computer Science Engineering. My skill set includes Python, Java, C, C++, Firebase, HTML, CSS, JS, Flask, TensorFlow, and I have a basic understanding of databases (SQL). I am not that much familiar with Linux though but I will surely start learning for it, and very well versed with Python. Please allow me to work on this, and I want to keep contributing to this project in the future too. Thanks in advance! @mwengren

harshshaw commented 3 years ago

Hi sir , I'm harsh shaw a second year undergraduate at SRM chennai for this project i think using GCP big query will be beneficial as it is capable of handle such big datas also integrating it with the ml models using gcp wont be much of an issue please do let me know if im thinking it in the right direction thank you ! @mwengren

harshshaw commented 3 years ago

About me :- https://www.linkedin.com/in/harsh-shaw-070105174

jarvis-001 commented 3 years ago

Hi @mwengren I am looking to contribute to the project " "Big Gridded Data": Distributed Cloud Storage for Physical Oceanography Data" . I believe using GCP big query could be highly beneficial for high loads of data and it would be much easier to integrate it with the ml models. About me ,I am a sophomore from Indian Institute of Technology, Roorkee I have a working familiarity with Linux, basically it's command line and a good idea about the architecture of the filesystem and a good knowledge of Python and C++. Also I have a good understanding of Databases including SQL and MongoDB.
My other skills include I have a good knowledge of machine learning tools and frameworks like PyTorch ,Tensorflow which could maybe give an additional help with the models and a working knowledge of HTML,CSS,JS ,React, Node,ExpressJS and MongoDB (Basically MERN for web development ) and basic flask which could help in designing the interface if needed (additional things which we can add in this ) . Could you please guide me with the implementation of the bigger model so that I can get started and what are there any tasks I need to do to get into the team ? Highly eager and excited to work with the team on the wonderful project

benjwadams commented 3 years ago

Hi all, please read up on some of the standards commonly in use by the earth science and oceanographic communities listed below:

NetCDF - A multidimensional file format that allows metadata to be attached to variables. NetCDF is one of the most common file formats and users generally will expect to be able to get NetCDF data back from a query https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html

OPeNDAP - A protocol over HTTP which allows access of data. The protocol is implemented by a number of data servers including ERDDAP, THREDDS, and PyDAP and often ends up serving NetCDF files in some way. Here is the DAP2 standard https://www.opendap.org/pdf/ESE-RFC-004v1.1.pdf DAP4 is current: https://docs.opendap.org/index.php/DAP4:_Specification_Volume_1

I can think of at least a couple ways that might be of interest to the broader scientific community. There has been considerable interest in having reasonably performant access to data through the aforementioned DAP protocols via cloud object stores. Other avenues have looked at using libraries which support distributed processing of numerical data through libraries such as zarr. This article written by @rsignell-usgs details some efforts made on this front to represent NetCDF4 and HDF5 using zarr: https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314

Please don't hesitate to ask any questions for clarification of the underlying technologies and data.

harshshaw commented 3 years ago

Hi @benjwadams is there any slack or discord channel where i can discuss this further with the mentors , i want to ask when we already have ERDDAP which is common dataserver and it formats as well , then why are we trying to store it in cloud ? since we can do a direct call to erddap server via OPeNDAP

jarvis-001 commented 3 years ago

Hi @harshshaw What I felt was that the ERDDAP database can't be used to directly access ZARR formats and since NetCDF and HDF5 format do not allow multiprocessing/parallel processing (though using MPI ( message passing interface) multi process threading can be done it is extremely hard ) so if we could directly use ZARR file we can even do high amount of computing on cloud only using cloud computing and thus minimize operations on our local device . Also using parallel processing on local device i/o processes can be easily completed, thus completing processes faster . Btw @mwengren @benjwadams is there any slack or discord channel where we can discuss further with the mentors .

jarvis-001 commented 3 years ago

@mwengren @benjwadams @daltonkell how to proceed further ?

benjwadams commented 3 years ago

Hi, I'm checking on the direction of prior work on Zarr and other technologies to determine an appropriate direction forward for this project.

jarvis-001 commented 3 years ago

Ok Ben Till now I have read all the resources thoroughly you gave in the issues section and have got a good idea of theoretical concepts . Now any more resources or concepts (theoretical or code ) that we should read or understand to make the further contribution part easier ?

ᐧ

On Tue, 6 Apr 2021, 00:24 Benjamin Adams, @.***> wrote:

Hi, I'm checking on the direction of prior work on Zarr and other technologies to determine an appropriate direction forward for this project.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-813575056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QD7XFVHWQUZZX774AADTHIBPRANCNFSM4Y6BFFUQ .

jarvis-001 commented 3 years ago

And @MathewBiddle what are the technologies that will be basically used in this project ? So that I could start learning them to make get a basic working idea of their working and use

benjwadams commented 3 years ago

Possibly related and of interest: https://github.com/zarr-developers/community/issues/15

jarvis-001 commented 3 years ago

OK sure Ben I'll check this out ᐧ

On Tue, Apr 6, 2021 at 2:24 AM Benjamin Adams @.***> wrote:

Possibly related and of interest: zarr-developers/community#15 https://github.com/zarr-developers/community/issues/15

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-813643856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QD44IUQ5UAC6YI56EV3THIPPNANCNFSM4Y6BFFUQ .

MathewBiddle commented 3 years ago

@jarvis-001 It looks like @benjwadams hit the nail on the head wrt technologies with this comment and this comment. Take a look at those and see if you have any questions.

jarvis-001 commented 3 years ago

@MathewBiddle pardon bit I didn't get which comment are you referring to ? Like it's just pointing to th initial project description written by @mwengren

MathewBiddle commented 3 years ago

Sorry https://github.com/ioos/gsoc/issues/5#issuecomment-806907067 and https://github.com/ioos/gsoc/issues/5#issuecomment-813643856.

jarvis-001 commented 3 years ago

Hi @MathewBiddle I went through the first source and am going through the second. And am really sorry couldn't reply earlier coz my mid sem examination are going on and will be completed by 9. But till then I will get through these too and understand them thoroughly.

jarvis-001 commented 3 years ago

Hi @MathewBiddle I went through the second source too and got a good idea about Zarr and n5 and all these stuff but I am having confusion about what are we trying to do like trying to replace HDF5 with zarr or some similar technology for storing our data . So @MathewBiddle @benjwadams @mwengren @daltonkell can we have real time meet so that I could ask all my doubts related to the project as that would highly help in making the proposal and further contributing

jarvis-001 commented 3 years ago

Like any that sort of meeting would speed up work to a very good extend as well as give a good clarity on how to proceed further Also I have some doubts regarding some points in proposal for G'SoC which would be cleared in the meeting . No time constraints with me (except 9 AM to 12 AM as per IST (Indian Standard Time) as I will be having my examination then ).

jarvis-001 commented 3 years ago

Hi @MathewBiddle I went through the second source too and got a good idea about Zarr and n5 and all these stuff but I am having confusion about what we are trying to do like trying to replace HDF5 with zarr or some similar technology for storing our data . So @MathewBiddle @benjwadams @mwengren @daltonkell can we have real time meet so that I could ask all my doubts related to the project as that would highly help in making the proposal and further contributing ᐧ

On Tue, Apr 6, 2021 at 7:42 PM Mathew Biddle @.***> wrote:

Sorry #5 (comment) https://github.com/ioos/gsoc/issues/5#issuecomment-806907067 and #5 (comment) https://github.com/ioos/gsoc/issues/5#issuecomment-813643856.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-814153773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QDY2BQFWEMACJHGJTRDTHMJENANCNFSM4Y6BFFUQ .

benjwadams commented 3 years ago

@jarvis-001 , we are looking for modern technologies which can integrate well with distributed data formats such as cloud object stores (e.g. Amazon S3) for storing NetCDF-like data. If we can also get such a backend integrated into data servers such as THREDDS or ERDDAP, that would be a plus, but not necessary in the original scope of work.

jarvis-001 commented 3 years ago

Ohh thanks @benjwadams , now I did actually get the whole problem statement and what we are actually trying to implement . But the thing is isn't s3-netCDF-python doing the same thing and basically by modern technologies what did you actually mean ? Like can you please elaborate more if possible ?

benjwadams commented 3 years ago

@jarvis-001, there exist bindings in netCDF-C. To my knowledge there currently aren't comparable bindings for Zarr formatted data for JVM based applications, although there is the start of such bindings here, for example: https://github.com/bcdev/jzarr . The community often uses THREDDS and ERDDAP to distribute data to end users, both of which are JVM based applications, which could benefit downstream from development of such bindings, or if said bindings are mature enough, integration of them into either JVM application.

jarvis-001 commented 3 years ago

Ohh ok @benjwadams I need to get a little more information about these points and will try to completely understand these by today only ( it's 2 am in IST so like in next 22 hours ) . I also found this while reading about JVM Zarr implementation https://jzarr.readthedocs.io/en/latest/. And one more thing like since the data is being at the moment stored in NetCDF format are we trying to shift it to Zarr format for benefits like multi processing etc ?

rsignell-usgs commented 3 years ago

Just reading this thread for the first time. Is the GSoC task evolving into enhancing NetCDF-Java to read Zarr?

If so, it would be good to engage Unidata, as I think they've been working on this also (and they just released the NetCDF C library 4.8.0 with Zarr support!

benjwadams commented 3 years ago

Just reading this thread for the first time. Is the GSoC task evolving into enhancing NetCDF-Java to read Zarr?

It certainly seems that this would be a desirable feature and would check a lot of the boxes for distributed data and cloud storage of data on services that are commonly used by the MetOcean community.

@rsignell-usgs, Do you know who we should direct further inquiries to at Unidata?

jarvis-001 commented 3 years ago

Ok so @benjwadams I went through all the resourced today and understood this that since java and other languages that are also compiled to Java bytecode using jvm are primarily used for server-side coding, but ZARR format is python based, so we are trying to bind Zarr formatted data for JVM based applications so that THREDDS and ERDDAP which are jvm based can be used to store and distribute ZARR formats . Is it what exactly we are trying to do @benjwadams

benjwadams commented 3 years ago

Your understanding seems good. Zarr is primarily Python-based these days, although the specification is available, so there is no reason why it couldn't be implemented in JVM based languages.

This would be a viable topic that fits into the description @mwengren provided above, correct.

jarvis-001 commented 3 years ago

Thanks @benjwadams. I was thinking the same and will complete my research by today about all the other doubts I have .

jarvis-001 commented 3 years ago

Ok @benjwadams @MathewBiddle now I did get what exactly we are trying now the thing is how will we be doing it technically ? To be precise on the coding part how will we be implementing all this ?

benjwadams commented 3 years ago

Please create a draft proposal on the GSoC site based upon the previous resources provided. We would be looking at making bindings to Zarr within Java.

jarvis-001 commented 3 years ago

@benjwadams @MathewBiddle @mwengren I am working on my proposal but am confused in writing abstract and timeline of the project and need guidance from you for that? So needed your guidance for what to write in abstract and timeline like we are not having that many distributed subtasks but instead just a full task so how to put that on timeline?

jarvis-001 commented 3 years ago

Hi @benjwadams @MathewBiddle could you please break the project into some small tasks or something like that so that I could make an accurate timeline for the project accordingly ? Like is it that first we will go through ZARR codebase and then try to understand and implement java binding to it ? It would be really great if anyone could elaborate about how they are thinking further about subtasks and workplan for the project ?

jarvis-001 commented 3 years ago

@benjwadams @MathewBiddle @mwengren I have made my proposal .Will you please share your email id so that I could share it to you and you all could review and give some suggestions about it

jarvis-001 commented 3 years ago

@benjwadams @MathewBiddle @mwengren am very sorry I didn't knew initially but I have shared my draft . So, please review it and do let me know wherever you feel like I need to do the changes And thanks everyone for guiding till now

benjwadams commented 3 years ago

@jarvis-001, please put on https://summerofcode.withgoogle.com/ -- I got your link but for maximum visibility you will will to place the draft on the GSoC page. You will have to put your final submission on that page.

jarvis-001 commented 3 years ago

@benjwadams I did share it. Please check again is it fully visible now ?

jarvis-001 commented 3 years ago

Hi Ben anything more I need to work on in proposal ? Or may I submit it now ?

benjwadams commented 3 years ago

As today is the deadline for the proposals, please submit the final version ASAP so we can enter into consideration for GSoC.

jarvis-001 commented 3 years ago

Hi Ben Mathew and Micah Thank you so much for your constant support and help during the entire time . It made thongs a lot easier to understand ...

mwengren commented 1 year ago

This project didn't end up being selected during GSoC 2021. Closing in order to clean out old issues in prep for GSoC 2023.

ioos / gsoc

"Big Gridded Data": Distributed Cloud Storage for Physical Oceanography Data #5