Parallelize ERDDAP server queries

mwengren commented 3 years ago

Project Description:

Note: this GSoC project should be done in combination with #25 to improve the way the colocate library interacts with ERDDAP servers generally. This issue aims to improve how efficient colocate is in searching ERDDAP servers for relevant data and extracting a subset of points to plot on the map view. #25 focuses on generating ERDDAP API URLs to extract individual datasets from ERDDAP servers. Interested students should submit applications that consider both issue #14 and #25 together for an overall ERDDAP enhancement GSoC project.

Existing code in the erddap_query.py module:

iterates through a Pandas dataframe of ERDDAP servers for datasets that match the standard_name, time period, and geographic bounding box submitted by the user in the notebook - see query(url, **kw)
queries the set of datasets returned from the above function for x/y positions of features within each matching dataset to display them on a HoloViews/DataShader map - see get_coordinates(df, **kw)

Both of these tasks could be parallelized using Python libraries, for example:

from joblib import Parallel, delayed
import multiprocessing

Or maybe another technique? Implementing parallelization will improve response time in the notebook and/or help eventual development of a dashboard application (#24) out of the project.

Extra Credit/Part 2:

The parallelization of the ERDDAP search step in query(url , **kw) should be a fairly minimal effort, and might not constitute a full project in terms of time commitment. More difficult will be the second get_coordinates(df, **kw) parallelization approach.

This has to do with the variability of the potential results from the search step: the app queries all known public ERDDAP services, each of which can have an unknown number of datasets with unknown data density. It is very easy to potentially overwhelm a single ERDDAP service with multiple parallel large data queries if a user searches too broadly or the results happen to include many datasets from a single ERDDAP.

An extension of this project could be to figure out how to change the UX of the app so that either:

a) the user interactively selects one or multiple datasets to display on the map together (rather than just displaying the first 10 of X number of unknown search results in random order as it does now) or b) parallelizing display of X number or results, but only requesting data from a single ERDDAP server in serial, or implementing some other means of preventing excessive concurrent data requests to the same ERDDAP server while still displaying an unknown number of dataset results from the search step in a DataShader map for visualization/preview

Expected Outcomes: Faster results returned from users' filter queries to the ERDDAP server list in the 'awesome-erddap' repo and the ability to plot with HoloViz/DataShader either the entire result set or user-selected subset of results without overwhelming ERDDAP servers

Skills required: Python programming, multi-threading

Difficulty: Low/Medium

Student Test:

Fork the repository
Run the colocate-dev.ipynb in Binder to get an idea for how the ERDDAP query code works in Cell 6 (using traditional Jupyter notebook rather than JupyterLab provides better debug output).
Investigate adding code to functions such as run.ui_query() and erddap.query() to query ERDDAP servers in parallel rather than serial - this is actually Part 1 of this project. This should help to familiarize yourself with the code base.
Push your changes to your GitHub fork and post a link to run your modified colocate-dev.ipynb notebook in this issue with the parallelized ERDDAP search code working
For the Extra Credit/Part 2 of this project, investigate the debug output in Cell 8 of colocate-dev.ipynb. Currently, Cell 8's code all_coords = erddap_query.get_coordinates(ui.df, **ui.kw) runs the erddap_query.get_coordinates() function, randomizes the datasets matched above, and searches in serial for the first 10 datasets from any ERDDAP dataset that return coordinate values to plot. How could this code be improved to be faster, but not send more than one request to a single ERDDAP server at a time, assuming we kept it to only show the first 10 datasets from Cell 6 that return coordinates? Suggest some ideas in your proposal.

Mentor(s): Micah Wengren @mwengren, Mathew Biddle @MathewBiddle

MathewBiddle commented 3 years ago

yeah, I just saw this from @ocefpaf

from joblib import Parallel, delayed
import multiprocessing

def request_whoi(dataset_id):
    e.constraints = None
    e.protocol = "tabledap"
    e.variables = ["longitude", "latitude", "temperature", "salinity"]
    e.dataset_id = dataset_id
    # Drop units in the first line and NaNs.
    df = e.to_pandas(response="csv", skiprows=(1,)).dropna()
    return (dataset_id, df)

num_cores = multiprocessing.cpu_count()
downloads = Parallel(n_jobs=num_cores)(
    delayed(request_whoi)(dataset_id) for dataset_id in whoi_gliders
)

dfs = {glider: df for (glider, df) in downloads}

ocefpaf commented 3 years ago

BTW, doing that on a single server will get some sys adims mad at you! But doing this across different servers is fine.

mwengren commented 3 years ago

For the first function, it should be fine, since we're just querying the Awesome ERDDAP server list one query each.

For the second, we randomize the dataset hits from the first query and iterate through them. There's a chance that they all could come from the same ERDDAP server.

We could skip the randomize step and figure out a safe way to parallelize this result set instead?

Basically these few lines of code is where that happens.

Rohan-cod commented 3 years ago

Venerated Sir, I hope you are safe and in good health in the wake of prevailing COVID-19. My name is Rohan Gupta and I am a 3rd-year Computer Science undergraduate student at Shri Mata Vaishno Devi University. I have been working with Python and deep learning for a couple of years now and have in-depth knowledge of it. I look forward to contributing to this idea as part of this year's GSoC. It would be a great assistance if you could suggest how to get started. My Linkedin Profile:- https://www.linkedin.com/in/rohang4837b4124/

mwengren commented 3 years ago

@Rohan-cod I added a 'Student Test' to the description with some ideas for how to get started. Please take a look there.

Rohan-cod commented 3 years ago

Sure Sir. I will try my best to solve the test. It will help get a deep understanding of the project and the codebase.

RATED-R-SUNDRAM commented 3 years ago

I have gone through the student tests of this issue and issue #25 and based on my understanding of the project meanwhile I have created a student proposal "https://docs.google.com/document/d/13Zx1hBk42hHPZuacIIlSohCtS7q-LrGcfvgVkd89lmI/edit?usp=sharing"

mwengren commented 2 years ago

An implementation of parallelized query has been added to erddapy, see ioos/erddapy#199.

In the erddapy implementation, the 'standard' ERDDAP search is used per format_search_string(). This allows search by keyword, essentially.

For colocate, however, we use the ERDDAP 'Advanced Search' to filter datasets (provided by the erddapy get_search_url() function) - implemented in colocate in the query() function by passing the **kw parameter on to erddapy to get more finer-grained results. Here's an example of the filter parameters we're currently passing via **kw:

{'min_lon': -72.706369,
 'max_lon': -64.17569700000001,
 'min_lat': 35.69571500000001,
 'max_lat': 41.28732200000002,
 'min_time': '2017-08-11T00:00:00Z',
 'max_time': '2019-12-31T00:00:00Z',
 'standard_name': 'sea_water_practical_salinity'}

Let's use the erddapy implementation of parallelized search in multiple_server_search.py as a starting point an implement something similar for colocate. Then perhaps it can be pushed upstream to erddapy if appropriate.

I also have the beginning of an implementation in this branch.

RATED-R-SUNDRAM commented 2 years ago

Hi Micah, Thanks a lot for letting me know about the execution of the task, Finally one of the key tasks in collocate has been done. Although I would have loved to do it GSOC instead, It’s execution makes me gives an immense satisfaction. Just a gentle remainder to let me know in Future If we can collaborate on any projects I will be always up for it. With RegardsShivam Sundram

mwengren commented 2 years ago

PR submitted to (mostly) resolve this: #26.

There are still issues for someone to solve, however:

How to implement a timeout in joblib to prevent waiting for the 'slowest' of the ERDDAP server collection to respond (or timeout).
Whether/how to parallelize the get_coordinates(df, **kw) function without sending too many requests to any one ERDDAP server.

ocefpaf commented 2 years ago

How to implement a timeout in joblib to prevent waiting for the 'slowest' of the ERDDAP server collection to respond (or timeout).

The function that is being parallelized should probably have the time out. I suggest the timeout_decorator but there may be newer techniques out there.

2. Whether/how to parallelize the get_coordinates(df, **kw) function without sending too many requests to any one ERDDAP server.

My guess is that latest ERDDAP servers will allow two concurrent connections, right? So there isn't much to be done in terms of making them parallel :-/

mwengren commented 2 years ago

@ocefpaf I hoped you might have some ideas. This would be a good project for someone to tackle for OHW if anyone is interested.

My initial approach to use the timeout parameter in joblib.Parallel - in my development branch here - fails because if the timeout is triggered, the assignment to results on line 24 fails, even for any Parallel jobs that have successfully completed before the timeout is triggered.

If that library supported timeouts natively in such a way, that would be great, because we could just set the timeout to 30 seconds and ignore any servers that don't respond by then (similar to how erddap.com must do it). But perhaps it must be done by the implementing function like you mentioned. .

I got too complicated for me so I quit at that point.

Just adding this link for reference, this is where I discovered the expected behavior of the timeout parameter: https://coderzcolumn.com/tutorials/python/joblib-parallel-processing-in-python#5

ocefpaf commented 2 years ago

My initial approach to use the timeout parameter in joblib.Parallel - in my development branch here - fails because if the timeout is triggered, the assignment to results on line 24 fails, even for any Parallel jobs that have successfully completed before the timeout is triggered.

Yeah. That is why I suggest to add the timeout decorator in do_query(server) instead. It would probably require some "gracefully" failed result that we can use later in the list.

ioos / colocate

Parallelize ERDDAP server queries #14