ioos / erddapy

Python interface for ERDDAP
https://ioos.github.io/erddapy/
BSD 3-Clause "New" or "Revised" License
77 stars 30 forks source link

GSoC 2022 ideas #228

Closed ocefpaf closed 1 year ago

ocefpaf commented 2 years ago

Some users are looking for tools to help them assemble ERDDAP urls for use in their own workflows, while others would prefer to work at a higher, more opinionated level. I believe we can more cleanly separate functionality to help support the spectrum of erddapy users better.

Originally errdapy was meant to be a url builder only. We added the main class later and stop in between trying to support many different usage patterns via the single primary class ERDDAP.

Issues to address

Proposed Solution

I proposed that we separate erddapy into more functional layers, roughly following the SQLAlchemy core/ORM model.

Core Layer

The core layer would contain two primary components of functionality: url generation & data transformation. This layer makes no choices or assumptions about IO allowing it to be reused easily.

Object (or opinionated) Layer

The object (or opinionated) layer would present higher level objects for searching servers and accessing datasets with a Pandas or xarray like returning or chainable API compared to the transformational API of the current ERDDAP class. This layer uses much of the core functionality and presents it in easy to use ways with an opinion as to the access method.

Additionally if possible these objects should be serializable, so they can be pickled and passed to other processes/machines (Dask/Dagster/Prefect).

For all of the remaining classes, either an ERDDAPConnection or a bare ERDDAP server url that will be transformed into an ERDDAPConnection can be passed in.

In Practice

So how do these work in practice? Let's look at a few different scenarios.

Interactive Search

Lets say that a user wants to find and query all datasets on a server that contain sea_water_temperature data?

First they initialize their server object. This can be done by passing in the server URL, the short name of the server, or an ERDDAPConnection object if authentication or IO methods need to be overridden.

[1] from erddapy import ERDDAPServer

[2] server = ERDDAPServer("neracoos")

Then they can use the native ERDDAP full text search to find datasets.

[3] water_temp_datasets = server.search("sea_water_temperature")
    water_temp_datasets

[3] {"nefsc_emolt_erddap": <TableDataset ...>, "UCONN_ARTG_WQ_BTM": <TableDataset...>, ...}

From there the user can access datasets a variety of ways depending on their needs.

[4] for dataset_id, dataset in water_temp_datasets:
        df = dataset.to_dataframe()
        # Whatever esoteric things fisheries people do with their dataframes
RATED-R-SUNDRAM commented 2 years ago

Hi, I am Shivam Sundram 3rd year undergraduate at IIT Mandi, I have 3+ years of python and contributing to open source. I had worked on erddap related project idea in the previous year's GSOC, Although not selected I had a got a fair exposure of the erddap workings and development, Here's a link to my previous work proposal .

This year I would again like to contribute to IOOS in this particular project and expand my knowledge on erddap , Please guide me on where to get started and what all references to visit so I can get started with my contribution to the project.

ocefpaf commented 2 years ago

Please guide me on where to get started and what all references to visit so I can get started with my contribution to the project.

You can start by taking a look at the code base and familiarizing with the proposed re-factor in this idea. If that is a task you want to tackle, the next step would be writing the proposal. Feel free to share a draft with us if you want some guidance there.

RATED-R-SUNDRAM commented 2 years ago

hi @ocefpaf I have gone through the code base, documentation, and your proposed idea, and want some clarifications on my doubts before I start drafting a proposal. As mentioned in the Issues to address you said that "Adding constraints to an ERDDAP object are stateful in place changes," and "Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO." so I need some clarification on whether we are using the functions to created objects at several instances so get rid of the statefulness of the IO or there is something else I misssed.

abkfenris commented 2 years ago

I helped @ocefpaf write those ideas up, so hopefully I can help answer your questions.

There are several different things wrapped up together in that write up, and I largely presented an end result, rather than the steps to get there. Both stateful-ness of the ERDDAP objects and the the I/O coupling are issues

Refactoring

I think the first step needed is to start refactoring as much possible out of the current ERDDAP class and into standalone, minimal functions. So building URLs should be separate functions from transforming a server response into desired output type.

https://github.com/ioos/erddapy/blob/9effa07d35b44a3f386ba7099a86501630aa618e/erddapy/erddapy.py#L618-L634

The current output methods show the steps of data acquisition (create URL, get data, transform data into desired output type), but while some steps are refactored to methods (self.get_download_url()), and some to functions (urlopen()), the transformation of the response into iris is bound up into the method on the ERDDAP class and thus cannot be reused.

While the ability to re-use functionality is less of an issue when exploring a single dataset on a single server, it can be a bigger one as you are querying many datasets, across many servers. Then instead of relying on a syncronous method to call the data, it may be better to use various async methods.

While erddapy could have an async def async_to_iris(), if we provide clear building blocks for building URLs and transforming data that can be reused in def to_iris(), then users can swap out for their preferred data fetching library and reuse the parts of erddapy that aren't constrained by I/O.

It also would help the test-ability of erddapy by having smaller more defined functionality.

I'd suggest taking a look at Sans I/O for some thinking along these lines.

# erddapy/core/tabledap.py

def tabledap_url(
    server: str,
    dataset: str,
    protocol: str,
    variables: List[str],
    constraints: Dict[str, ...]:
) -> str:
     ...

def iris_url(
    server: str,
    dataset: str,
    variables: List[str],
    constraints: Dict[str, ...]:
) -> str:
    """ Form a URL that will cause ERDDAP to return a response that `iris_data()` can turn into an iris.CubeList """
    return tabledap_url(...)

def iris_data(response: str | bytes, **kw) -> iris.CubeList:
    """ Take a response from an ERDDAP server from a query to `iris_url()` and return a iris.CubeList"""
    ...

Then those can be used in the existing ERDDAP.to_iris() method to provide backwards compatiblity, but also be reused.

# erddapy/erddapy.py

class ERDDAP:

    def to_iris(self, **kw):
        url = tabledap.iris_url(self.server, self.dataset, ...)
        data = urlopen(url, self.auth, **self.response_kwargs)
        return tabledap.iris_data(data)

Functional API

Once the core functionality is refactored out, then on top of that we can build a new API that is more functional (like most of Pandas or xarray). By building a functional API, we can support cleaner iterative usage, and reduce the chances that users will end up with unintentional variables or constrains due to mutating the wrong ERDDAP object.

vinisalazar commented 2 years ago

Hi @ocefpaf and @abkfenris,

I'm Vini, a graduate student at the University of Melbourne. I have around 5 years of experience with Python and have contributed to a few OS libraries such as BioPython, Snakemake, Spyder IDE (to the docs), and have also developed my own libraries (an example here).

I'm interested in applying to work on this project, as I am currently using ERDDAP and erddapy for my own research. I started a branch on my erddapy fork to try to implement some of what has been discussed here, e.g. I created a core subpackage with the iris_data function proposed by @abkfenris and a corresponding test. This is merely a draft, and logically as I started to develop the branch, many questions are starting to come up, specially regarding the structure of the package after the refactoring.

I will work on a proposal within the next few days and will report back here when it's done. Please let me know if you have any recommendations other than what's already documented in this issue.

Thank you for providing erddapy and I look forward to applying to GSoC!

Cheers, Vini

ocefpaf commented 2 years ago

@RATED-R-SUNDRAM and @vinisalazar sorry for me delayed response. I had some family health issues in the past few weeks.

Thanks @abkfenris for the guidance above. That is pretty much what we are looking for in your proposals. Please be sure to send them via GSoC system before the 19th! If you have a draft and want to share with us, to get some feedback, please do so ASAP.

RATED-R-SUNDRAM commented 2 years ago

Hi @ocefpaf hope you are doing good now, I had a doubts i would like to get some clarity before presenting my draft proposal in front of you, As mentioned by you in your we are supposed to first filter out the code in 2 parts one being the core as you mentioned which will have the independent set of functions irrespective of the I/O constraints and later we are intrending to refactor the main erddap class by splitting various functions (which can be) into standalone functions which are not constrained by I/O.

Please provide some input on this which would help me in getting the final idea for preparing the draft , Also I am willing to do this project in 175 hours category hope thats fine for you.

ocefpaf commented 2 years ago

@RATED-R-SUNDRAM 175 hours is fine as long as you make a feasible schedule for it. We won't judge a proposal based on quantity, only on quality.

If you are aiming for 175 hrs I would recommend you to focus only on the refactor part. That means moving all the URL builder functionality to a new module, as stand alone functions, and make use of them in the ERDDAP class to keep compatibility.

RATED-R-SUNDRAM commented 2 years ago

@ocefpaf I have sent a draft proposal to you over the mail please provide some inputs over it before I submit the final proposal on the gsoc website.

vinisalazar commented 2 years ago

Hello @ocefpaf and @abkfenris,

I hope you are well. I have just submitted a proposal through the GSoC dashboard. Please let me know if there are any problems or if you didn't receive it. I will still make some adjustments and submit an additional version by the program's deadline. If you can provide your feedback for the version I submitted, that would be incredibly be helpful. Obviously as we are close to the deadline and this weekend is a holiday in many parts of the world, I understand if that's not possible.

Thank you, Vini

shauryabaijal commented 2 years ago

Hello @ocefpaf and @abkfenris

I hope you are well. I am Shaurya Baijal . I have been working with machine learning and data science for two years now. My passion for it has motivated me to take part in GSOC. After going through numerous organizations, I found IOOS very interesting; I am especially interested in using my data science and data visualization skills in Therefore, I would love to spend my summer working with and contributing to IOSS organization. I have around 2 years of experience with Python and Jupyter Notebook and have contributed to a few OS libraries such as [BioPython] ([Spyder IDE]

I'm interested in applying to work on this project, as I am currently using ERDDAP . Developments in this project will go hand-in-hand . I have fluency in Python and other Object-Oriented Programming Languages and a certain preliminary understanding of cloud computing too ( worked on different google labs regarding cloud computing earlier ) . My LinkedIn is: https://www.linkedin.com/in/shaurya-b-275630203

Thank you for providing erddapy and I look forward to applying to GSoC!

Thanks Shaurya

I have a question that for project title - erddapy are my GSoC 2022 will be @ocefpaf and @abkfenris only or any other mentor will also be their for guidance

ocefpaf commented 2 years ago

@shauryabaijal due to the fast approaching deadline I recommend you to share your proposal with us ASAP, preferably via Google Docs. We can iterate there and make suggestions. However, it is possible that you'll have to submit before you get any feedback, at the moment we are swamped reviewing proposals that were already submitted.

shauryabaijal commented 2 years ago

@ocefpaf Sir I had shared my proposal with you on your gmail I'd please kindly see to it and suggest what changes I need to make in it I am extremely sorry Sir for the delay Thanking you in Anticipation

shauryabaijal commented 2 years ago

@ocefpaf and @abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

ocefpaf commented 2 years ago

@ocefpaf and @abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

That depends on you proposal. We are accepting both as long as your proposed schedule makes sense to the amount of effort you plan to put on the project.

shauryabaijal commented 2 years ago

@ocefpaf and @abkfenris So Sir in my GSoC 2022 dashboard what I need to mention in project size for erddapy as on gsoc dashboard it is clearly written

That Project size must match the size that the target organization expects. If you are proposing an idea from an organization list, it should match what is listed there.

Do not change the size (larger or smaller) without getting approval from the target organization first. It may result in your proposal being rejected.

Please kindly review my proposal and clarify this Sir

On Tue, 19 Apr, 2022, 6:59 am Filipe, @.***> wrote:

@ocefpaf https://github.com/ocefpaf and @abkfenris https://github.com/abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

That depends on you proposal. We are accepting both as long as your proposed schedule makes sense to the amount of effort you plan to put on the project.

— Reply to this email directly, view it on GitHub https://github.com/ioos/erddapy/issues/228#issuecomment-1101905092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXTO3P5RYYGA4ZLHW4NIXTVFYEAXANCNFSM5M3ZVNRQ . You are receiving this because you were mentioned.Message ID: @.***>

ocefpaf commented 2 years ago

That Project size must match the size that the target organization expects.

We will review the proposals first and then mark the expected size based on the best one. Like we mentioned before the size won't be part of the evaluation, only the quality of the proposal.

ocefpaf commented 1 year ago

GSoC22 is done. Let's close this issue.