ontodev / sprocket

BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Support Swagger endpoint #20

Closed beckyjackson closed 2 years ago

beckyjackson commented 2 years ago

Resolves #11 - this time no database is created, we just hit the endpoint for each request.


Alternatively, you can provide the URL to a PostGRES Swagger endpoint, such as https://www.cmi-pb.org/api/v2. Each request will be sent to the endpoint and the JSON results will be displayed as the same HTML table as providing a database.

sprocket https://www.cmi-pb.org/api/v2
jamesaoverton commented 2 years ago

This is working well for CMI-PB endpoint. Two things:

  1. Are you seeing a InsecureRequestWarning: Unverified HTTPS request ...?
  2. The caching strategy only works for a single table. If I look at 'cell_type' and then 'specimen', I will see checkboxes for the 'cell_type' columns. For now I would prefer caching to a file: a dictionary from table names to a list of column names.

I also tried this with the IEDB API sprocket https://query-api.iedb.org. It's slow, but the API itself is slow. Can you please try that and make sure Sprocket is doing what you expect?

beckyjackson commented 2 years ago

Are you seeing a InsecureRequestWarning: Unverified HTTPS request ...?

Yes. I couldn't hit the https endpoint without including verify=False on my request. I get the following:

requests.exceptions.SSLError: HTTPSConnectionPool(host='www.cmi-pb.org', port=443): Max retries exceeded with url: /api/v2/specimen?select=specimen_id,subject_id,specimen_type,visit (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))

I did try a couple of different fixes for this issue and none of them resolved it... This may only be an issue on my end, in which case we should remove that verify=False.

jamesaoverton commented 2 years ago

Hmm. I'm not sure this is respecting my specified 'limit' for Swagger.

beckyjackson commented 2 years ago

It's not exactly respecting the limit, it just displays the limited results, but I need to know how many total results exist. For small datasets, that's OK, but for large ones it could become an issue (maybe that's why IEDB is being slow, but I wasn't seeing super slow responses). In order for the "next" button to work correctly at the end of a set of data, we need to know how many total results there are. One solution is to do one query when we first load the table (like with the columns) that counts the total results, and stores this in a file like we do with the columns:

SELECT COUNT(*) FROM tablename

Then we already have the total and we don't need to get all results every time.

I'll try implementing this today and then we can see if it speeds up the IEDB results.

beckyjackson commented 2 years ago

Whoops, I just realized we can't do a query like that. I guess we can just get all results once and then cache the total? But then that defeats the purpose of not storing the results themselves. What if we have a .swagger/ directory that stores the JSON?

jamesaoverton commented 2 years ago

Postgrest puts range, count, and pagination information in the HTTP response headers: https://postgrest.org/en/v9.0/api.html?highlight=limit#limits-and-pagination

beckyjackson commented 2 years ago

This is exactly what I needed, thanks!

beckyjackson commented 2 years ago

This is weird - it looks like in order to use the offset query parameter for the IEDB API, you must also include an order. This is not the case for the CMI-PB API. Without order, you get an error (only from IEDB):

Query string appears to include an offset parameter without an order parameter. Please resubmit the query with an order parameter to ensure consistent paging. The query was not sent to the API.

It's a little annoying that this isn't consistent between databases. Should we have a default ordering on the first column for all databases, or do you have any other ideas of a workaround for this?

jamesaoverton commented 2 years ago

PostgreSQL does not order tables by default. If you provide an offset without an order, you can get unpredictable results. Some IEDB API users complained about this, and so the IEDB API is checking for this case and returning this error message before PostgREST is called.

I'm not sure what we should do about that. Probably nothing, for now: just display the error. We can explicitly specify an order.

beckyjackson commented 2 years ago

I think this is ready for review again @jamesaoverton

The first time you load a table will take a bit longer, but since the totals & columns are cached after that, I think loading subsequent requests will be much faster.

The table is very messy for lots of columns (e.g., antigen_search). I'm not sure what to do about that.