Improved API suggestion

ehsteve commented 5 years ago

Following up on our discussion during the Heliopython meeting, I thought I would provide my suggestions here. The purpose of these suggestions is also to make it easier to document the input arguments.

First, the client object should be instantiable regardless of whether the server is available therefore

h = hapi(url)

This should return an instance ready to connect to the url. It would then be worhwhile to provide a test of server availability such as

h.connect()

Using this API would match that used by the Python socket module. It would return False if it does not get a response from the server so that something like the following would be possible

if h.connect():
   do something

Once the connection is established, information could be asked for on the dataset available from the server with

list_of_datasets = h.info()

Next a user would likely want to connect to a particular dataset and then ask for data. This could be done as following

h.set_dataset(list_of_datasets[0])
data = h.get_data(start_time, end_time)

If a user wanted to get the data from all datasets then they could easily run

for this_dataset in list_of_datasets:
  h.set_dataset(this_dataset)
  data = h.get_data(start_time, end_time)

It may also be worthwhile to provide the optional argument with instantiating to enable the following

with hapi(url, connect=True) as h:
   do something with h

rweigel commented 5 years ago

I think the main problem is that that hapi function call signature is too overloaded. A function-based approach could be

from hapiclient import hapi

status = hapi.status(server) # True or False

array = hapi.catalog(server)

array = hapi.capabilities(server) 

dict = hapi.info(server, dataset[, parameters])

data = hapi.data(server, dataset, parameters, start, stop)

I think the hapi.connect is a bit misleading because it would seem that the connection would persist as if it were, say, a SQL database. This type function is not present in urlopen or urlread libraries. I suggest

if hapi.status(server):
    # Do something

Regarding using a class, it could be

from hapiclient import HAPI

H = HAPI(server)

status = H.status()

array = H.catalog()

array = H.capabilities() 

H.set_dataset(dataset)
dict = H.info(server, dataset)

H.set_parameters(parameters)
dict = H.info(server, dataset, parameters)

data = H.data(start, stop)

The number of lines are about the same and I think the function case is more familiar to the typical user. I actually think the typical scientist code that uses HAPI will look like like about 3-5 (based on data used in a paper) of these blocks:

server     = 'https://cdaweb.gsfc.nasa.gov/hapi'
dataset    = 'AC_H0_MFI'
start      = '2001-01-01T05:00:00'
stop       = '2001-01-01T06:00:00'
parameters = 'Magnitude,BGSEc'
opts       = {'logging': True, 'use_cache': True}    

data1,meta1 = hapi(server, dataset, parameters, start, stop, **opts)

server     = 'http://hapi-server.org/servers/SSCWeb/hapi'
dataset    = 'ace'
start      = '2001-01-01T05:00:00'
stop       = '2001-01-01T06:00:00'
parameters = 'X_GSE,Y_GSE,Z_GSE'
opts       = {'logging': True, 'use_cache': True}

data2,meta2 = hapi(server, dataset, parameters, start, stop, **opts)

Or,

start = '2001-01-01T05:00:00'
stop  = '2001-01-01T06:00:00'

ds = []
ds.append(['https://cdaweb.gsfc.nasa.gov/hapi','AC_HO_MFI','Magnitude,BGSEc'])
ds.append(['https://cdaweb.gsfc.nasa.gov/hapi','WIND','Vx,Vy,Vz'])
ds.append(['http://hapi-server.org/servers/SSCWeb/hapi','ace','X_GSE,Y_GSE,Z_GSE'])
ds.append(['http://hapi-server.org/servers/SSCWeb/hapi','wind','X_GSE,Y_GSE,Z_GSE'])

d = []
for i,s in enumerate(D):
    data = hapi.data(s,d,p,start,stop)
    d.append(data)

# Save data and then read data in analysis and plot scripts. No need to re-run
# above again.

Most likely the above would be iterated over a set of 5-10 start/stop times.

I don't expect much use of hapi.status() to see if the server is run. For the most part I expect users to get the data and move on. Similar to downloading a bunch of files, parsing, and then saving in a native binary format. Once that is done, one never re-runs the download and parsing scripts again.

I also don't expect a user to be using the catalog and info functions. HAPI intentionally only provides a minimal amount of metadata. The expectation is that a user will use some search facility to find dataset names and parameters they want and then copy the names into a script. In retrospect, I suppose that is why I accepted the overloading of hapi. I figured most users would only every want to use the form

data = hapi.data(server, dataset, parameters, start, stop)

@jvandegriff and @jbfaden - any thoughts?

ehsteve commented 5 years ago

I don't see how this approach solves "the main problem is that hapi function call signature is too overloaded." which I think we both agree on. This just moves the overload problem from hapi() to hapi.data(), right? It is also unclear to me what the instance hapi would do all functionality is given by hapi.data() function call.

The other idea would be to provide both approaches and let users decide how they want to interact with. Your usage case suggests an advanced user that knows exactly what they want and just gets it and moves on. This approach is not very friendly to an exploratory user who is trying to find what is available.

On the usefulness of h.status(), the main purpose I see is to not force users to deal with exceptions or errors. So

if h.status():
   data = hapi.data(server, dataset, parameters, start, stop)
else:
   data = None

instead of

try:
   data = hapi.data(server, dataset, parameters, start, stop)
except SomeError:
  data = None

jeandet commented 3 years ago

Hi @rweigel @jbfaden , I was about to write the same issue. After the first day of use, I think that hapi(*args,**kwargs) isn't expressive enough. There should be a function for each request (server inventory, dataset info, parameter info). This would be more intuitive. User could benefit code completion. Function signature would better without args/kwargs and with named args. You can still maintain hapi(*args,**kwargs) for some time with a deprecation warning and just dispatch to specific function depending on args. This would also make functions body less scary :p.

I also wonder if it makes sense to return dict with stuff like status, instead of Optional[object] with members containing expected data, current API looks like it mixes abstraction levels.

Last point, I think that hapi-client should just provide data access and any plot/display features should be moved to another package. hapi-client should not leak any matplotlib dependency and let the user(package) provide his own if needed.

rweigel commented 3 years ago

There is going to be a major refactoring soon and we will revisit this. The issue of the documentation and code completion has bothered me too. I wrote this when I was just learning Python after finishing the MATLAB version. In MATLAB, variable input-output is the defacto convention and it works well. I was also trying to mimic the REST interface. Somehow this does not translate to Python.

At the very least, there should be four different functions such as hapiclient.servers, hapiclient.datasets(server), hapiclient.parameters(server, dataset), and then hapiclient.data(....).

In the second paragraph, I think that you are suggesting that the interface is more like the urllib* Python libraries. I agree that that would make more sense. Doing this would take a lot of thought given that it would require a major re-write and I'm not sure how this will impact caching or the fact that the client splits long requests into parts and then requests the parts separately and in parallel.

I agree that it is time to move the plotting functionality into another package. It was there for testing the client during development but has grown too much.

jbfaden commented 3 years ago

I'd suggest using existing names for functions, like hapiclient.getCatalog(server), hapiclient.getInfo(server,id),hapiclient.getData(...).

jeandet commented 3 years ago

I was about to propose some kind of POC so we can discuss around some code and see how hapi-client could evolve. The idea is not necessary to rewrite the whole client but have some minimal code that works and see if it makes sense to pursue this direction then how to turn this into a PR for hapi-client. I'll do my best to get something before Monday.

jeandet commented 3 years ago

POC started here

rweigel commented 3 years ago

I think the POC is a good starting point for discussion. I probably won't have time for it until mid-December, but will discuss it when I meet with one of the developers who is working on the parallel requests branch and see what his thoughts are.

jeandet commented 3 years ago

Here is a basic notebook showing how to use the POC API https://nbviewer.jupyter.org/github/jeandet/hapi_client_poc/blob/main/examples/demo.ipynb

Few random thoughts after writing this code:

I didn't write binary or json data parser, this is not needed for this POC.
I added a cache for requests except for data since it needs more work and it's already done in SPWC or current hapy-client impl. See notes here
Parallel requests handling is quite easy to implement here
both caching and splitting are just requests decorators, this ensure separation of concerns and makes code more modular.
I feel like using object instead of dictionary is not that bad since it offers code completion and allows some nice features like here (__description could be dropped since get_info is cached )
the API can be used with both get_{endpoint}(url,...) or with context managers as here
I still don't know what could be the correct return type from get_data, Pandas DataFrames don't play well with 2D+ data. Simple numpy arrays miss all the nice indexing features from Pandas. For SPWC, I ended up writing my own basic class which provides handy indexing features(slicing and merging are used a lot for caching and requests splitting) without Pandas limitations such as dimension count or missing metadata attribute. Maybe something like this could be used?

That's all for today! :)

rweigel commented 3 years ago

A few quick points:

Regarding 2. , the client already has an option for caching data and metadata. There has been some discussion on a specification for the data cache. We are planning to modify the client so it uses this specification.

We almost have the parallel request code completed. It turns out to be a bit more complicated than I expected.

Regarding 7., there is another group working on a data model for Python and heliophysics data. My plan was to keep the plain NumPy array and then allow users to use the library that they produce to map to a different data model. So my plan was to keep the client data model as simple as possible.

jeandet commented 3 years ago

About the cache I'm not sure to see benefits of a specification other than sharing data across client implementations. Since it should provide good performances and be robust, this might imply totally different implementations depending the client.

After a quick look at the wiki page:

note that /tmp is cleared after reboot, a better location on unix systems could be ~/.cache/. appdirs provides a simple way to solve this issue across operating systems.
allowing to mix files containing single and all parameters should increase code complexity while I see no reason. One file per parameter might simplify everything and allows concurrent loading.
it misses information about data validity for cache eviction .
some allowed characters in URLs might not be in some operating systems such as Windows.
note that the .pid solution still needs two checks one before creating file.pid and one right after since two .pid files could be created at the same time (same issue than mutex implementation) and since the the creator process can't infer others PID it implies to list folder again and search for others PIDs(this might have some cost). This also seems to miss the multi-threads case or did you consider that this has to be handled differently, from the code?

Writing such code might get tricky since it exposes many corner cases especially with multiprocessing(concurrent access on files is always a mess). I would really advise to use diskcahe it's insanely fast and thread/process safe, plus it has a really user friendly UX.

hapi-server / client-python

Improved API suggestion #3