GSoC 2022 ideas - Githubissues

ocefpaf commented 2 years ago

Some users are looking for tools to help them assemble ERDDAP urls for use in their own workflows, while others would prefer to work at a higher, more opinionated level. I believe we can more cleanly separate functionality to help support the spectrum of erddapy users better.

Originally errdapy was meant to be a url builder only. We added the main class later and stop in between trying to support many different usage patterns via the single primary class ERDDAP.

Issues to address

Users in interactive workflows are required to transform ERDDAP objects as they wish to connect to new datasets, for example when moving from searching a server to visiting the datasets.
Adding constraints to a ERDDAP object are stateful in place changes, where as most interactive users are used to Numpy/Pandas/xarray style workflows where you can return or chain together changes.
Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO.

Proposed Solution

I proposed that we separate erddapy into more functional layers, roughly following the SQLAlchemy core/ORM model.

Core Layer

The core layer would contain two primary components of functionality: url generation & data transformation. This layer makes no choices or assumptions about IO allowing it to be reused easily.

URL generation - Functions to generate valid URLs from bare components, such as a dataset name, format, and dictionary of constraints to tabledap/M01_met_all.csv?time%2Cair_temperature%26air_temperature_qc%3D0%26time>%3D"2020-12-09T15%3A25%3A00.000Z"
Data transformation - Functions to convert a raw response (.csv, .nc, ...) into Pandas DataFrames, xarray Datasets...

Object (or opinionated) Layer

The object (or opinionated) layer would present higher level objects for searching servers and accessing datasets with a Pandas or xarray like returning or chainable API compared to the transformational API of the current ERDDAP class. This layer uses much of the core functionality and presents it in easy to use ways with an opinion as to the access method.

Additionally if possible these objects should be serializable, so they can be pickled and passed to other processes/machines (Dask/Dagster/Prefect).

class ERDDAPConnection
- While most ERDDAP servers allow connections via a bare url, some servers may require authentication to access data.
- .get(url_part: str) -> bytes or str
  - Method actually request data.
  - Uses requests by default similar to most of the current erddapy data fetching functionality.
  - Can be overridden to use httpx, and potentially aiohttp or other async functionality, which could hopefully make anything else async compatible. (investigate await_me_maybe)
- .open(url_part: str) -> fp
  - Yields a file-like object for access (probably use fsspec.open under the hood) for file types/tools that don't enjoy getting passed a string.
- @property(server) -> ERDDAPConnection
  - Return a new ERDDAPConnection if trying to set a new server, or change other attributes rather than changing it in place.

For all of the remaining classes, either an ERDDAPConnection or a bare ERDDAP server url that will be transformed into an ERDDAPConnection can be passed in.

class ERDDAPServer
- .__init__(connection: str | ERDDAPConnection)
- .full_text_search(query: str) -> dict[str, ERDDAPDataset]
  - Use the native ERDDAP full text search capabilities
  - Returns a dictionary of search results with dataset ids as keys and ERDDAPDataset values.
- .search(query: str) -> dict[str, ERDDAPDataset]
  - Points to .full_text_search
- advanced_search(**kwargs) -> dict[str, ERDDAPDataset]
  - Uses ERDDAPs advanced search capabilities (may return pre-filtered datasets)
class ERDDAPDataset

Base class for more focused table or grid datasets.
- @property(connection)
  - Underlying ERDDAPConnection
- .get(file_type: str) -> bytes or str
  - Requests the data using the .connection.get() method.
- .open(file_type: str) -> fp
  - Yields a file-like object for access.
- .get_meta()
  - Pulls the dataset info and caches it on the _meta attribute.
- ._meta
  - Set by .get_meta()
  - Passed when a setter returns a subclass.
  - .attrs -> pd.DataFrame- Dataframe of dataset attributes.
  - .variables -> dict - Dictionary of variables as keys, and maximum extent of constraints as values.
- @property(meta)
  - Returns the ._meta values, and will call .get_meta() if they are not already cached.
- @property(variables)
  - List current variables the dataset requested from the dataset.
  - Setting variables returns a new ERDDAPDataset subclass.
  - If _meta is cached and an invalid variable is set, throw a KeyError instead of returning.
- @property(constraints)
  - Returns the current constraints on the dataset.
  - Setting contraints returns a new ERDDAPDataset subclass.
  - If _meta is cached and an invalid constraint is set, throw a KeyError instead of returning.
- .url_segment(file_type: str) -> str
  - Everything but the base section of the url (http://neracoos.org/erddap/), so tabledap/A01_met.csv....
- .url(file_type: str) -> str
  - Returns a URL constructed using the underlying ERDDAPConnection base class server info, the dataset ID, access method (tabledap/griddap), file type, variables, and constraints.
  - This allows ERDDAPDataset subclasses to be used as more opinionated URL constructors while still not tying the users to an specific IO method.
  - Not guaranteed to capture all the specifics of formatting a request, such as if a server requires specific auth or headers.
- .to_dataset() - Open the dataset as an xarray dataset by downloading a subset NetCDF.
- .opendap_dataset() - Open the full dataset in xarray via OpenDAP.
class TableDataset(ERDDAPDataset)
- .to_dataframe() - Open the dataset as a Pandas DataFrame.
class GridDataset(ERDDAPDataset)

In Practice

So how do these work in practice? Let's look at a few different scenarios.

Interactive Search

Lets say that a user wants to find and query all datasets on a server that contain sea_water_temperature data?

First they initialize their server object. This can be done by passing in the server URL, the short name of the server, or an ERDDAPConnection object if authentication or IO methods need to be overridden.

[1] from erddapy import ERDDAPServer

[2] server = ERDDAPServer("neracoos")

Then they can use the native ERDDAP full text search to find datasets.

[3] water_temp_datasets = server.search("sea_water_temperature")
    water_temp_datasets

[3] {"nefsc_emolt_erddap": <TableDataset ...>, "UCONN_ARTG_WQ_BTM": <TableDataset...>, ...}

From there the user can access datasets a variety of ways depending on their needs.

[4] for dataset_id, dataset in water_temp_datasets:
        df = dataset.to_dataframe()
        # Whatever esoteric things fisheries people do with their dataframes

RATED-R-SUNDRAM commented 2 years ago

Hi, I am Shivam Sundram 3rd year undergraduate at IIT Mandi, I have 3+ years of python and contributing to open source. I had worked on erddap related project idea in the previous year's GSOC, Although not selected I had a got a fair exposure of the erddap workings and development, Here's a link to my previous work proposal .

This year I would again like to contribute to IOOS in this particular project and expand my knowledge on erddap , Please guide me on where to get started and what all references to visit so I can get started with my contribution to the project.

ocefpaf commented 2 years ago

Please guide me on where to get started and what all references to visit so I can get started with my contribution to the project.

You can start by taking a look at the code base and familiarizing with the proposed re-factor in this idea. If that is a task you want to tackle, the next step would be writing the proposal. Feel free to share a draft with us if you want some guidance there.

RATED-R-SUNDRAM commented 2 years ago

hi @ocefpaf I have gone through the code base, documentation, and your proposed idea, and want some clarifications on my doubts before I start drafting a proposal. As mentioned in the Issues to address you said that "Adding constraints to an ERDDAP object are stateful in place changes," and "Switching out IO is currently non-trivial due to URL generation and data transformations being tightly coupled to IO." so I need some clarification on whether we are using the functions to created objects at several instances so get rid of the statefulness of the IO or there is something else I misssed.

abkfenris commented 2 years ago

I helped @ocefpaf write those ideas up, so hopefully I can help answer your questions.

There are several different things wrapped up together in that write up, and I largely presented an end result, rather than the steps to get there. Both stateful-ness of the ERDDAP objects and the the I/O coupling are issues

Refactoring

I think the first step needed is to start refactoring as much possible out of the current ERDDAP class and into standalone, minimal functions. So building URLs should be separate functions from transforming a server response into desired output type.

https://github.com/ioos/erddapy/blob/9effa07d35b44a3f386ba7099a86501630aa618e/erddapy/erddapy.py#L618-L634

The current output methods show the steps of data acquisition (create URL, get data, transform data into desired output type), but while some steps are refactored to methods (self.get_download_url()), and some to functions (urlopen()), the transformation of the response into iris is bound up into the method on the ERDDAP class and thus cannot be reused.

While the ability to re-use functionality is less of an issue when exploring a single dataset on a single server, it can be a bigger one as you are querying many datasets, across many servers. Then instead of relying on a syncronous method to call the data, it may be better to use various async methods.

While erddapy could have an async def async_to_iris(), if we provide clear building blocks for building URLs and transforming data that can be reused in def to_iris(), then users can swap out for their preferred data fetching library and reuse the parts of erddapy that aren't constrained by I/O.

It also would help the test-ability of erddapy by having smaller more defined functionality.

I'd suggest taking a look at Sans I/O for some thinking along these lines.

# erddapy/core/tabledap.py

def tabledap_url(
    server: str,
    dataset: str,
    protocol: str,
    variables: List[str],
    constraints: Dict[str, ...]:
) -> str:
     ...

def iris_url(
    server: str,
    dataset: str,
    variables: List[str],
    constraints: Dict[str, ...]:
) -> str:
    """ Form a URL that will cause ERDDAP to return a response that `iris_data()` can turn into an iris.CubeList """
    return tabledap_url(...)

def iris_data(response: str | bytes, **kw) -> iris.CubeList:
    """ Take a response from an ERDDAP server from a query to `iris_url()` and return a iris.CubeList"""
    ...

Then those can be used in the existing ERDDAP.to_iris() method to provide backwards compatiblity, but also be reused.

# erddapy/erddapy.py

class ERDDAP:

    def to_iris(self, **kw):
        url = tabledap.iris_url(self.server, self.dataset, ...)
        data = urlopen(url, self.auth, **self.response_kwargs)
        return tabledap.iris_data(data)

Functional API

Once the core functionality is refactored out, then on top of that we can build a new API that is more functional (like most of Pandas or xarray). By building a functional API, we can support cleaner iterative usage, and reduce the chances that users will end up with unintentional variables or constrains due to mutating the wrong ERDDAP object.

vinisalazar commented 2 years ago

Hi @ocefpaf and @abkfenris,

I'm Vini, a graduate student at the University of Melbourne. I have around 5 years of experience with Python and have contributed to a few OS libraries such as BioPython, Snakemake, Spyder IDE (to the docs), and have also developed my own libraries (an example here).

I'm interested in applying to work on this project, as I am currently using ERDDAP and erddapy for my own research. I started a branch on my erddapy fork to try to implement some of what has been discussed here, e.g. I created a core subpackage with the iris_data function proposed by @abkfenris and a corresponding test. This is merely a draft, and logically as I started to develop the branch, many questions are starting to come up, specially regarding the structure of the package after the refactoring.

I will work on a proposal within the next few days and will report back here when it's done. Please let me know if you have any recommendations other than what's already documented in this issue.

Thank you for providing erddapy and I look forward to applying to GSoC!

Cheers, Vini

ocefpaf commented 2 years ago

@RATED-R-SUNDRAM and @vinisalazar sorry for me delayed response. I had some family health issues in the past few weeks.

Thanks @abkfenris for the guidance above. That is pretty much what we are looking for in your proposals. Please be sure to send them via GSoC system before the 19th! If you have a draft and want to share with us, to get some feedback, please do so ASAP.

RATED-R-SUNDRAM commented 2 years ago

Hi @ocefpaf hope you are doing good now, I had a doubts i would like to get some clarity before presenting my draft proposal in front of you, As mentioned by you in your we are supposed to first filter out the code in 2 parts one being the core as you mentioned which will have the independent set of functions irrespective of the I/O constraints and later we are intrending to refactor the main erddap class by splitting various functions (which can be) into standalone functions which are not constrained by I/O.

Please provide some input on this which would help me in getting the final idea for preparing the draft , Also I am willing to do this project in 175 hours category hope thats fine for you.

ocefpaf commented 2 years ago

@RATED-R-SUNDRAM 175 hours is fine as long as you make a feasible schedule for it. We won't judge a proposal based on quantity, only on quality.

If you are aiming for 175 hrs I would recommend you to focus only on the refactor part. That means moving all the URL builder functionality to a new module, as stand alone functions, and make use of them in the ERDDAP class to keep compatibility.

RATED-R-SUNDRAM commented 2 years ago

@ocefpaf I have sent a draft proposal to you over the mail please provide some inputs over it before I submit the final proposal on the gsoc website.

vinisalazar commented 2 years ago

Hello @ocefpaf and @abkfenris,

I hope you are well. I have just submitted a proposal through the GSoC dashboard. Please let me know if there are any problems or if you didn't receive it. I will still make some adjustments and submit an additional version by the program's deadline. If you can provide your feedback for the version I submitted, that would be incredibly be helpful. Obviously as we are close to the deadline and this weekend is a holiday in many parts of the world, I understand if that's not possible.

Thank you, Vini

shauryabaijal commented 2 years ago

Hello @ocefpaf and @abkfenris

I hope you are well. I am Shaurya Baijal . I have been working with machine learning and data science for two years now. My passion for it has motivated me to take part in GSOC. After going through numerous organizations, I found IOOS very interesting; I am especially interested in using my data science and data visualization skills in Therefore, I would love to spend my summer working with and contributing to IOSS organization. I have around 2 years of experience with Python and Jupyter Notebook and have contributed to a few OS libraries such as [BioPython] ([Spyder IDE]

I'm interested in applying to work on this project, as I am currently using ERDDAP . Developments in this project will go hand-in-hand . I have fluency in Python and other Object-Oriented Programming Languages and a certain preliminary understanding of cloud computing too ( worked on different google labs regarding cloud computing earlier ) . My LinkedIn is: https://www.linkedin.com/in/shaurya-b-275630203

Thank you for providing erddapy and I look forward to applying to GSoC!

Thanks Shaurya

I have a question that for project title - erddapy are my GSoC 2022 will be @ocefpaf and @abkfenris only or any other mentor will also be their for guidance

ocefpaf commented 2 years ago

@shauryabaijal due to the fast approaching deadline I recommend you to share your proposal with us ASAP, preferably via Google Docs. We can iterate there and make suggestions. However, it is possible that you'll have to submit before you get any feedback, at the moment we are swamped reviewing proposals that were already submitted.

shauryabaijal commented 2 years ago

@ocefpaf Sir I had shared my proposal with you on your gmail I'd please kindly see to it and suggest what changes I need to make in it I am extremely sorry Sir for the delay Thanking you in Anticipation

shauryabaijal commented 2 years ago

@ocefpaf and @abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

ocefpaf commented 2 years ago

@ocefpaf and @abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

That depends on you proposal. We are accepting both as long as your proposed schedule makes sense to the amount of effort you plan to put on the project.

shauryabaijal commented 2 years ago

@ocefpaf and @abkfenris So Sir in my GSoC 2022 dashboard what I need to mention in project size for erddapy as on gsoc dashboard it is clearly written

That Project size must match the size that the target organization expects. If you are proposing an idea from an organization list, it should match what is listed there.

Do not change the size (larger or smaller) without getting approval from the target organization first. It may result in your proposal being rejected.

Please kindly review my proposal and clarify this Sir

On Tue, 19 Apr, 2022, 6:59 am Filipe, @.***> wrote:

@ocefpaf https://github.com/ocefpaf and @abkfenris https://github.com/abkfenris sir the size of the project ( erddapy ) is medium or large please clarify

That depends on you proposal. We are accepting both as long as your proposed schedule makes sense to the amount of effort you plan to put on the project.

— Reply to this email directly, view it on GitHub https://github.com/ioos/erddapy/issues/228#issuecomment-1101905092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYXTO3P5RYYGA4ZLHW4NIXTVFYEAXANCNFSM5M3ZVNRQ . You are receiving this because you were mentioned.Message ID: @.***>

ocefpaf commented 2 years ago

That Project size must match the size that the target organization expects.

We will review the proposals first and then mark the expected size based on the best one. Like we mentioned before the size won't be part of the evaluation, only the quality of the proposal.

ocefpaf commented 1 year ago

GSoC22 is done. Let's close this issue.

ioos / erddapy

GSoC 2022 ideas #228

Issues to address

Proposed Solution

Core Layer

Object (or opinionated) Layer

In Practice

Interactive Search

Refactoring

Functional API