punkish / zenodeo

`nodejs` interface to Zenodo/BLR community
https://zenodeo.org
Creative Commons Zero v1.0 Universal
6 stars 1 forks source link

Zenodeo 3.0 rfc #50

Open punkish opened 4 years ago

punkish commented 4 years ago

Zenodeo version 3.0

These quieter days of July-August, I have been working on Zenodeo version 3.0. The next release of the API will have significant changes that will make it more powerful and flexible for developers. The general scheme will remain the same

┌────────┐┌─┐┌────────┐                   
│resource││?││query   │                   
└────────┘└─┘└────────┘                  
     │            │                     
     ▼            │                     
                  │                     
treatments        │                     
                  │                     
                  │                     
                  │                     
                  ▼                     
             key=value

// for example
https://zenodeo.org/treatements?querystring                       

A big enhancement in the new API will be a simple but effective query language that will be embeddable in the querystring. Here is an example (all subject to change between now and release)

https://zenodeo.org/treatments?treatmentTitle_$start=Carvalhoma&journalTitle_$end=Taxonomy&publicationDate_$between=y2016m1d12-y2018m6d10&articleAuthor_$contains=Slater&$cols=treatmentId,zenodoDep,journalYear&$page=1&$size=30&$sortby=journalYear_asc,zenodoDep_desc

Here is the querystring broken down in key-value pairs for clarity

treatmentTitle_$start=Carvalhoma
journalTitle_$end=Taxonomy
publicationDate_$between=y2016m1d12-y2018m6d10
articleAuthor_$contains=Slater
$cols=treatmentId,zenodoDep,journalYear
$page=1
$size=30
$sortby=+journalYear,-zenodoDep

The sequence of actions on the server will happen like this

  1. Find all treatments where treatmentTitle begins with Carvalhoma and journalTItle ends with Taxonomy and publicationDate is between Jan 12, 2016 and Jun 10, 2018 and articleAuthor contains Slater
  2. Select treatmentId, zenodoDep and journalYear for all the found rows
  3. Sort the rows by journalYear in ascending and then zenodoDep in descending order
  4. Return the first 30 rows (30 from 1)

Notes:

  1. Keys or parts of keys starting with $ are reserved terms used by the Zenodeo API (SQL mavens will be very familiar with most of these terms)
  2. All $start, $end, and $contains queries will be case-insensitive
  3. Multiple $sortby column_direction pairs will be possible in a comma separated list

I will be publishing other details of the API as time goes by, but I wanted to get this one out right away so I can benefit from your collective insights, feelings, or concerns. I have been writing the url parsing and query building module, so your input at this time will be very valuable. Looking forward to your feedback.

@mguidoti @myrmoteras @tcatapano @teodorgeorgiev @gsautter @slint

punkish commented 4 years ago

The new API will have two new resources authorities and articleAuthors. The first one, authorities will be queryable by authorityName and authorityYear, and will deprecate treatmentAuthors. I am still toying with the idea of normalizing authorityName and articleAuthor but will probably not go there as that is a rat's nest.

punkish commented 4 years ago

ZQL (Zenodeo Query Language)

Note 1: all _$... commands are optional. If they are omitted then an exact equality will be performed. Note 2: more than one terms can be provided as a comma separated list for equality searches which will trigger an IN search. For example (searches will be case-insensitive even though I am showing them with preserved case below; the database itself is COLLATE NOCASE so searches ignore case)

authorityName=Agosti → authorityName = 'Agosti'
authorityName=Agosti,Penev → authorityName IN ('Agosti', 'Penev')

Commands that modify search

These commands are passed as _$... suffixes to query parameters (column names in the data store). If these commands are not present then an equality = search is performed. If they are present, then the search is modified accordingly. If they are wrong or invalid commands, an error is thrown.

param_$starts=<string>: @param LIKE '<string>%'
param_$ends=<string>: @param LIKE '%<string>'
param_$contains=<string>: @param LIKE '%<string>%'
q=<string>: @q MATCH <string> [^1]
param_$eq=<string>: = @param = '<string>'
param_$since=<string>: dateconvert(@param >= dateconvert('<string>')
param_$until=<string>: dateconvert(@param <= dateconvert('<string>')
param_$between=<string>: dateconvert(@param) >= dateconvert('<string1>') && dateconvert(@param) <= dateconvert('<string2>') [^2]
param_$gte=<string>: @param >= Cast('<string>' AS NUMERIC)
param_$lte=<string>: @param <= Cast('<string>' AS NUMERIC)
param_$gt=<string>: @param > Cast('<string>' AS NUMERIC)
param_$lt=<string>: @param < Cast('<string>' AS NUMERIC)

[1]: q is a special param that does a fulltext search against a special table [2]: If param is a date field, then date comparion is used. If param is a number, then numeric comparison is used as follows param_$between=<string>: @param >= Cast('<string>' AS NUMERIC) && @param <= Cast('<string>' AS NUMERIC)

Commands that modify the result

These commands are passed as standalone $... keys and they modify the found results. If these commands are not present then appropriate defaults are used. If they are present and are wrong or invalid commands, an error is thrown.

$sortby=<colname>.<sortdir asc|desc>: ORDER BY <colname> <sortdir = 'asc' ? ASC : DESC> default resourceId ASC
$page=<integer>: OFFSET (@page - 1) default 1
$size=<integer>: LIMIT @size default 30
$facets=<boolean>: default false
$stats=<boolean>: default false
$xml=<boolean>: default false
$refreshCache=<boolean>: default false
punkish commented 4 years ago

URL construction rules

  1. All routes start with resource name plural – /treatments or /materialsCitations
  2. routes are case sensitive, param values are case-insensitive – /materialsCitations is the correct resource, /materialscitations will result in a 404
  3. queryParams are case-sensitive, their values are not – ?articleAuthor=Agosti and ?articleAuthor=agosti will return the same records but ?articleauthor=Agosti will result in an error
  4. All queryParams are optional, but some have default values (for example, if $page is not provided, it will default to 1 and if $size is not provided, it will default to 30). If no queryParams are provided then a default result with counts of the resource will be returned
  5. $facets and $stats are expensive, and only certain clients need them. They will be performed only when explicitly requested
  6. Every resource has a whose name is the singular of the resource suffixed with 'Id' – treatmentstreatmentId or materialsCitationsmaterialsCitationId
  7. Different routes can lead to the same resource, but never to different resources. Synonymous routes internally will point to the same result set by ID (I still have to figure out how to implement this)
  8. Only standards-compliant characters will be allowed in the URL
punkish commented 4 years ago

ZQL syntax

I wrote a program that successfully parses API query string and builds safe SQL queries. But then, I had a rethink about the syntax. The outcome is below, before and after. I think the second version is cleaner as it doesn't have that '$' nonsense, and the keys are straightforward. I would love to know what you all think. Can you think of any problems/gotchas with the second version?

Initial suggestion

param_$starts_with=<string>: @param LIKE '<string>%'
param_$ends_with=<string>: @param LIKE '%<string>'
param_$contains=<string>: @param LIKE '%<string>%'
q=<string>: @q MATCH <string> [^1]
param_$eq=<string>: @param = '<string>'
param_$since=<date-string>: dateconvert(@param >= dateconvert('<string>')
param_$until=<date-string>: dateconvert(@param <= dateconvert('<string>')
param_$between=<date-string>: dateconvert(@param) >= dateconvert('<string1>') && dateconvert(@param) <= dateconvert('<string2>') [^2]
param_$gte=<number>: @param >= Cast('<string>' AS NUMERIC)
param_$lte=<number>: @param <= Cast('<string>' AS NUMERIC)
param_$gt=<number>: @param > Cast('<string>' AS NUMERIC)
param_$lt=<number>: @param < Cast('<string>' AS NUMERIC)
loc_$within=<number1 kms>,<number2 lat>,<number3 lng>: 

$col=default|all|<string1 col>,<string2 col>…
$size=<integer>: LIMIT @size default 30
$page=<integer>: OFFSET (@page - 1) default 1
$sortby=<colname>.<sortdir asc|desc>: ORDER BY <colname> <sortdir = 'asc' ? ASC : DESC> default resourceId ASC
$facets=<boolean>: default false
$stats=<boolean>: default false
$xml=<boolean>: default false
$refreshCache=<boolean>: default false

Improved version

param=starts_with(<string>): @param LIKE '<string>%'
param=ends_with(<string>): @param LIKE '%<string>'
param=contains(<string>): @param LIKE '%<string>%'
q=<string>: @q MATCH <string> [^1]
param=<string>: @param = '<string>'
param=since(<date-string>): dateconvert(@param >= dateconvert('<string>')
param=until(<date-string>): dateconvert(@param <= dateconvert('<string>')
param=between(<date-string1>-<date-string2>: dateconvert(@param) >= dateconvert('<string1>') && dateconvert(@param) <= dateconvert('<string2>') [^2]
param=gte(<number>): @param >= Cast('<string>' AS NUMERIC)
param=lte(<number>): @param <= Cast('<string>' AS NUMERIC)
param=gt(<number>): @param > Cast('<string>' AS NUMERIC)
param=lt(<number>): @param < Cast('<string>' AS NUMERIC)
loc=within(<number1 kms>,<number2 lat>,<number3 lng>)

col=default|all|<string1 col>,<string2 col>…
size=<integer>: LIMIT @size default 30
page=<integer>: OFFSET (@page - 1) default 1
sortby=<colname>.<sortdir asc|desc>: ORDER BY <colname> <sortdir = 'asc' ? ASC : DESC> default resourceId ASC
facets=<boolean>: default false
stats=<boolean>: default false
xml=<boolean>: default false
refreshCache=<boolean>: default false

Note: All zql operators are optional. If not provided, the default operator for that param will be used as described in the data-dictionary.

punkish commented 4 years ago

ZQL syntax

Development on ZQL is moving along. Based on my experience so far, I have made the following modifications to the syntax

Even more improved version

param=eq(<string>): @param = '<string>%' → synonym of 'param=<string>'
param=starts_with(<string>): @param LIKE '<string>%'
param=ends_with(<string>): @param LIKE '%<string>'
param=contains(<string>): @param LIKE '%<string>%'
q=<string>: @q MATCH <string> [^1]
param=<string>: @param = '<string>'
param=since({"y":<yyyy>, "m": <mm?>, "d": <dd?>}): date(@param >= date('<string>')
param=until({"y":<yyyy>, "m": <mm?>, "d": <dd?>}): date(@param <= date('<string>')
param=between({"from":{"y":<yyyy>, "m": <mm?>, "d": <dd?>},"to":{"y":<yyyy>, "m": <mm?>, "d": <dd?>})): date(@param) >= date('<string1>') && date(@param) <= date('<string2>') [^2]
param=gte(<number>): @param >= Cast('<string>' AS NUMERIC)
param=lte(<number>): @param <= Cast('<string>' AS NUMERIC)
param=gt(<number>): @param > Cast('<string>' AS NUMERIC)
param=lt(<number>): @param < Cast('<string>' AS NUMERIC)
loc=within({"r":<number>, units: "<string>", "lat":<real>, "lng":<real>})

col=default|all|<string1 col>,<string2 col>…
size=<integer>: LIMIT @size default 30
page=<integer>: OFFSET (@page - 1) default 1
sortby=<colname>.<sortdir asc|desc>: ORDER BY <colname> <sortdir = 'asc' ? ASC : DESC> default resourceId ASC
facets=<boolean>: default false
stats=<boolean>: default false
xml=<boolean>: default false
refreshCache=<boolean>: default false

As you can see above, the main difference is that in composite operators such as the date operators (between(), since(), and until()) and the location operator (within()), a JSON-style notation is used for the values. One potential disadvantage is that the querystring has to be URL encoded, but that is good practice anyway. The major advantage is that there is no ambiguity anymore and the sub-params can be provided in any order.

Note: All zql operators are optional. If not provided, the default operator for that param will be used as described in the data-dictionary.