Closed mwengren closed 1 year ago
Can i work on this project
Hello, sirs, I am a 3rd-year college student from Computer Science Engineering. My skill set includes Python, Java, C, C++, Firebase, HTML, CSS, JS, Flask, TensorFlow, and I have a basic understanding of databases (SQL). I am not that much familiar with Linux though but I will surely start learning for it, and very well versed with Python. Please allow me to work on this, and I want to keep contributing to this project in the future too. Thanks in advance! @mwengren
Hi sir , I'm harsh shaw a second year undergraduate at SRM chennai for this project i think using GCP big query will be beneficial as it is capable of handle such big datas also integrating it with the ml models using gcp wont be much of an issue please do let me know if im thinking it in the right direction thank you ! @mwengren
About me :- https://www.linkedin.com/in/harsh-shaw-070105174
Hi @mwengren
I am looking to contribute to the project " "Big Gridded Data": Distributed Cloud Storage for Physical Oceanography Data" . I believe using GCP big query could be highly beneficial for high loads of data and it would be much easier to integrate it with the ml models.
About me ,I am a sophomore from Indian Institute of Technology, Roorkee
I have a working familiarity with Linux, basically it's command line and a good idea about the architecture of the filesystem and a good knowledge of Python and C++. Also I have a good understanding of Databases including SQL and MongoDB.
My other skills include I have a good knowledge of machine learning tools and frameworks like PyTorch ,Tensorflow which could maybe give an additional help with the models and a working knowledge of HTML,CSS,JS ,React, Node,ExpressJS and MongoDB (Basically MERN for web development ) and basic flask which could help in designing the interface if needed (additional things which we can add in this ) .
Could you please guide me with the implementation of the bigger model so that I can get started and what are there any tasks I need to do to get into the team ?
Highly eager and excited to work with the team on the wonderful project
Hi all, please read up on some of the standards commonly in use by the earth science and oceanographic communities listed below:
NetCDF - A multidimensional file format that allows metadata to be attached to variables. NetCDF is one of the most common file formats and users generally will expect to be able to get NetCDF data back from a query https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html
OPeNDAP - A protocol over HTTP which allows access of data. The protocol is implemented by a number of data servers including ERDDAP, THREDDS, and PyDAP and often ends up serving NetCDF files in some way. Here is the DAP2 standard https://www.opendap.org/pdf/ESE-RFC-004v1.1.pdf DAP4 is current: https://docs.opendap.org/index.php/DAP4:_Specification_Volume_1
I can think of at least a couple ways that might be of interest to the broader scientific community. There has been considerable interest in having reasonably performant access to data through the aforementioned DAP protocols via cloud object stores. Other avenues have looked at using libraries which support distributed processing of numerical data through libraries such as zarr. This article written by @rsignell-usgs details some efforts made on this front to represent NetCDF4 and HDF5 using zarr: https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314
Please don't hesitate to ask any questions for clarification of the underlying technologies and data.
Hi @benjwadams is there any slack or discord channel where i can discuss this further with the mentors , i want to ask when we already have ERDDAP which is common dataserver and it formats as well , then why are we trying to store it in cloud ? since we can do a direct call to erddap server via OPeNDAP
Hi @harshshaw What I felt was that the ERDDAP database can't be used to directly access ZARR formats and since NetCDF and HDF5 format do not allow multiprocessing/parallel processing (though using MPI ( message passing interface) multi process threading can be done it is extremely hard ) so if we could directly use ZARR file we can even do high amount of computing on cloud only using cloud computing and thus minimize operations on our local device . Also using parallel processing on local device i/o processes can be easily completed, thus completing processes faster . Btw @mwengren @benjwadams is there any slack or discord channel where we can discuss further with the mentors .
@mwengren @benjwadams @daltonkell how to proceed further ?
Hi, I'm checking on the direction of prior work on Zarr and other technologies to determine an appropriate direction forward for this project.
Ok Ben Till now I have read all the resources thoroughly you gave in the issues section and have got a good idea of theoretical concepts . Now any more resources or concepts (theoretical or code ) that we should read or understand to make the further contribution part easier ?
ᐧ
On Tue, 6 Apr 2021, 00:24 Benjamin Adams, @.***> wrote:
Hi, I'm checking on the direction of prior work on Zarr and other technologies to determine an appropriate direction forward for this project.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-813575056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QD7XFVHWQUZZX774AADTHIBPRANCNFSM4Y6BFFUQ .
And @MathewBiddle what are the technologies that will be basically used in this project ? So that I could start learning them to make get a basic working idea of their working and use
Possibly related and of interest: https://github.com/zarr-developers/community/issues/15
OK sure Ben I'll check this out ᐧ
On Tue, Apr 6, 2021 at 2:24 AM Benjamin Adams @.***> wrote:
Possibly related and of interest: zarr-developers/community#15 https://github.com/zarr-developers/community/issues/15
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-813643856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QD44IUQ5UAC6YI56EV3THIPPNANCNFSM4Y6BFFUQ .
@MathewBiddle pardon bit I didn't get which comment are you referring to ? Like it's just pointing to th initial project description written by @mwengren
Hi @MathewBiddle I went through the first source and am going through the second. And am really sorry couldn't reply earlier coz my mid sem examination are going on and will be completed by 9. But till then I will get through these too and understand them thoroughly.
Hi @MathewBiddle I went through the second source too and got a good idea about Zarr and n5 and all these stuff but I am having confusion about what are we trying to do like trying to replace HDF5 with zarr or some similar technology for storing our data . So @MathewBiddle @benjwadams @mwengren @daltonkell can we have real time meet so that I could ask all my doubts related to the project as that would highly help in making the proposal and further contributing
Like any that sort of meeting would speed up work to a very good extend as well as give a good clarity on how to proceed further Also I have some doubts regarding some points in proposal for G'SoC which would be cleared in the meeting . No time constraints with me (except 9 AM to 12 AM as per IST (Indian Standard Time) as I will be having my examination then ).
Hi @MathewBiddle I went through the second source too and got a good idea about Zarr and n5 and all these stuff but I am having confusion about what we are trying to do like trying to replace HDF5 with zarr or some similar technology for storing our data . So @MathewBiddle @benjwadams @mwengren @daltonkell can we have real time meet so that I could ask all my doubts related to the project as that would highly help in making the proposal and further contributing ᐧ
On Tue, Apr 6, 2021 at 7:42 PM Mathew Biddle @.***> wrote:
Sorry #5 (comment) https://github.com/ioos/gsoc/issues/5#issuecomment-806907067 and #5 (comment) https://github.com/ioos/gsoc/issues/5#issuecomment-813643856.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ioos/gsoc/issues/5#issuecomment-814153773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOX6QDY2BQFWEMACJHGJTRDTHMJENANCNFSM4Y6BFFUQ .
@jarvis-001 , we are looking for modern technologies which can integrate well with distributed data formats such as cloud object stores (e.g. Amazon S3) for storing NetCDF-like data. If we can also get such a backend integrated into data servers such as THREDDS or ERDDAP, that would be a plus, but not necessary in the original scope of work.
Ohh thanks @benjwadams , now I did actually get the whole problem statement and what we are actually trying to implement . But the thing is isn't s3-netCDF-python doing the same thing and basically by modern technologies what did you actually mean ? Like can you please elaborate more if possible ?
@jarvis-001, there exist bindings in netCDF-C. To my knowledge there currently aren't comparable bindings for Zarr formatted data for JVM based applications, although there is the start of such bindings here, for example: https://github.com/bcdev/jzarr . The community often uses THREDDS and ERDDAP to distribute data to end users, both of which are JVM based applications, which could benefit downstream from development of such bindings, or if said bindings are mature enough, integration of them into either JVM application.
Ohh ok @benjwadams I need to get a little more information about these points and will try to completely understand these by today only ( it's 2 am in IST so like in next 22 hours ) . I also found this while reading about JVM Zarr implementation https://jzarr.readthedocs.io/en/latest/. And one more thing like since the data is being at the moment stored in NetCDF format are we trying to shift it to Zarr format for benefits like multi processing etc ?
Just reading this thread for the first time. Is the GSoC task evolving into enhancing NetCDF-Java to read Zarr?
If so, it would be good to engage Unidata, as I think they've been working on this also (and they just released the NetCDF C library 4.8.0 with Zarr support!
Just reading this thread for the first time. Is the GSoC task evolving into enhancing NetCDF-Java to read Zarr?
It certainly seems that this would be a desirable feature and would check a lot of the boxes for distributed data and cloud storage of data on services that are commonly used by the MetOcean community.
@rsignell-usgs, Do you know who we should direct further inquiries to at Unidata?
Ok so @benjwadams I went through all the resourced today and understood this that since java and other languages that are also compiled to Java bytecode using jvm are primarily used for server-side coding, but ZARR format is python based, so we are trying to bind Zarr formatted data for JVM based applications so that THREDDS and ERDDAP which are jvm based can be used to store and distribute ZARR formats . Is it what exactly we are trying to do @benjwadams
Your understanding seems good. Zarr is primarily Python-based these days, although the specification is available, so there is no reason why it couldn't be implemented in JVM based languages.
This would be a viable topic that fits into the description @mwengren provided above, correct.
Thanks @benjwadams. I was thinking the same and will complete my research by today about all the other doubts I have .
Ok @benjwadams @MathewBiddle now I did get what exactly we are trying now the thing is how will we be doing it technically ? To be precise on the coding part how will we be implementing all this ?
Please create a draft proposal on the GSoC site based upon the previous resources provided. We would be looking at making bindings to Zarr within Java.
@benjwadams @MathewBiddle @mwengren I am working on my proposal but am confused in writing abstract and timeline of the project and need guidance from you for that? So needed your guidance for what to write in abstract and timeline like we are not having that many distributed subtasks but instead just a full task so how to put that on timeline?
Hi @benjwadams @MathewBiddle could you please break the project into some small tasks or something like that so that I could make an accurate timeline for the project accordingly ? Like is it that first we will go through ZARR codebase and then try to understand and implement java binding to it ? It would be really great if anyone could elaborate about how they are thinking further about subtasks and workplan for the project ?
@benjwadams @MathewBiddle @mwengren I have made my proposal .Will you please share your email id so that I could share it to you and you all could review and give some suggestions about it
@benjwadams @MathewBiddle @mwengren am very sorry I didn't knew initially but I have shared my draft . So, please review it and do let me know wherever you feel like I need to do the changes And thanks everyone for guiding till now
@jarvis-001, please put on https://summerofcode.withgoogle.com/ -- I got your link but for maximum visibility you will will to place the draft on the GSoC page. You will have to put your final submission on that page.
@benjwadams I did share it. Please check again is it fully visible now ?
Hi Ben anything more I need to work on in proposal ? Or may I submit it now ?
As today is the deadline for the proposals, please submit the final version ASAP so we can enter into consideration for GSoC.
Hi Ben Mathew and Micah Thank you so much for your constant support and help during the entire time . It made thongs a lot easier to understand ...
This project didn't end up being selected during GSoC 2021. Closing in order to clean out old issues in prep for GSoC 2023.
Project Description:
Storing highly-voluminous and highly-dimensional data has always presented challenges, and while hardware advancements have eased some of the burden, software remains the critical component in data management systems. This project will explore burgeoning solutions in the big-data realm to store massive volumes of highly-dimensional numeric data across distributed cloud platforms. Participants will examine tradeoffs between technologies and develop deeper understanding of how new data storage and access solutions may be implemented in the oceanography industry.
Expected Outcomes:
A software cost-benefit analysis of data storage and access scenarios.
Skills required:
Familiarity with Linux/UNIX operating systems and a working knowledge of Python, C/C++. Understanding basic database architecture is a plus.
Difficulty:
Moderately difficult
Mentor(s):
@daltonkell Dalton Kell (Software Engineer), @benjwadams Ben Adams (Software Engineer)