plone / plone.restapi

RESTful API for Plone.
http://plonerestapi.readthedocs.org/
86 stars 78 forks source link

Querystring Search Performance #1252

Closed tisto closed 1 year ago

tisto commented 3 years ago

Possible Solutions

Use GET instead of POST

GET /plone/folder/@querystring-search/1669e4ae9c5258f973ea00a6c31578a9 HTTP/1.1
Accept: application/json
Authorization: Basic YWRtaW46c2VjcmV0
Content-Type: application/json

{
    "body": {
        "b_size": "2",
        "b_start": "0",
        "fullobjects": "False",
        "limit": "10",
        "query": [
            {
                "i": "Title",
                "o": "plone.app.querystring.operation.string.is",
                "v": "Welcome to Plone"
            },
            {
                "i": "path",
                "o": "plone.app.querystring.operation.string.path",
                "v": "/news"
            }
        ],
        "sort_on": "sortable_title",
        "sort_order": "reverse",
    },
    "id": "doc1",
    "method": "GET",
}

Problems to solve

tisto commented 3 years ago

jMeter Performance Test Results

querystring-search performance tests

Jenkins (3)

respondingTimeGraphPerTestCaseMode

-> you can see that the results are actually pretty close to each other, at least when we hammer the instance a bit. -> with less load, you can see that "fullobjects=1" takes double the time than the request without, when doing lots of request that does not matter that much any longer

spread of the results for one kind of request

Jenkins (2)

-> you can see that even for the exact same request (in this case, content object with fullobjects=1) the response times are pretty widespread, this looks like a load problem

cc @jensens @cekk @sneridagh

jensens commented 3 years ago

See linked PR at ZCatalog above here.

tisto commented 3 years ago

@jensens on our way back home from Sorrento, @jackahl and I came up with an idea of how to solve the querystring-search caching issue. We could, as discussed, use a GET request instead of POST and use subtraversal for a unique URL. Though, instead of either encoding the complex query in the URL as GET params (which has lots of possible side effects and the character limit), we could just send the query in the body and generate a unique hash in the URL, to make it unique.

I outlined the idea above:

https://github.com/plone/plone.restapi/issues/1252#issue-1037098395

And I also started to play around with plone.rest, to see if that works:

https://github.com/plone/plone.rest/pull/125

Though, the main question would be if we get away with bending the HTTP standard:

https://stackoverflow.com/questions/978061/http-get-with-request-body?answertab=votes#tab-top

and if Varnish/Cloudflare or any other system (Python requests lib, curl, etc. seem to work) just works or if we run into issues.

The other option would be to send the encoded querystring in the URL as subtraversal, with the problem that it is pretty opaque for devs and that it can run into the 2000 characters limit:

https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers

@jackahl gave me the idea for the solution we came up with when he proposed to send the "clean" query within the HTTP body, just for reference and dev convenience.

I'd love to hear what you think! :)

cc @sneridagh @ericof

@lukasgraf @buchi did you ever had the need to cache POST requests in plone.restapi?

jensens commented 3 years ago

My overall reaction: Great idea, using a hash and sub-traversal .... but is this within the spec?

Specification wise (HTTP 1.1) in RFC 7231 https://www.rfc-editor.org/rfc/rfc7231#page-25

A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request.

When we implement it this way we have no guarantee if webserver, caches, web application firewalls, and such operating between the client and the api-server are working correctly.

My conclusion: No, better not. We open pandoras box here.

I would propose to check the length of the encoded payload, and if the length exceeds the limit: go with a POST. According to the last URL in the previous post here, this is for a JS-generated request something around 2k is save, which is a lot for a query. Browser detection and increasing the limit based on this could be an option.

Inspired by your idea to use the body, we could use the Headers to send the payload (combined with a hash). But this feels somehow wrong and hackish as well, even if we would kind of stay within the specs.

Overall this is not that satisfying

tisto commented 3 years ago

@jensens I share your doubts and I agree that this could cause problems. Though, Elastic Search is doing exactly this (GET with a query in the payload):

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-search-api-example

Therefore it might be worth a shot. If we would run into issues, this would be a deal-breaker of course.

Making the frontend decide to use either GET or POST could be a valid way to go. Though we still have to fix the nested querystring issue and this would mean that long queries can not be cached at all.

I thought about using headers as well. Though, as you said, this feels hacky and there is also a size limit on HTTP headers. The body is the only size-safe way.

tisto commented 3 years ago

Just a follow up: Solr also uses GET for search requests and it seems that GET is even the default here:

https://solr.apache.org/guide/8_10/json-request-api.html https://dzone.com/articles/solr-select-query-get-vs-post

tisto commented 3 years ago

@sneridagh @tiberiuichim @plone/volto-team my main challenge right now it how to properly convert a nested dict with an array to a proper querystring. In Volto we are using the "query-string" lib which intentionally does not support nesting:

https://github.com/sindresorhus/query-string#nesting

After digging into this rabbit whole I can fully understand this decision. Though, this would mean that on Volto (or any other restapi request) needs to "stringify" an object within the request.

In JS this would look like this:

const queryString = require('query-string');

queryString.stringify({
    foo: 'bar',
    nested: JSON.stringify({
        unicorn: 'cake'
    })
});
//=> 'foo=bar&nested=%7B%22unicorn%22%3A%22cake%22%7D'

In Python like this:

        payload = {
            "query": json.dumps(
                [
                    {
                        "i": "Title",
                        "o": "plone.app.querystring.operation.string.is",
                        "v": "Welcome to Plone",
                    },
                    {
                        "i": "path",
                        "o": "plone.app.querystring.operation.string.path",
                        "v": "/news",
                    },
                ]
            ),
            "sort_on": "sortable_title",
            "sort_order": "reverse",
            "limit": "10",
            "fullobjects": "False",
            "b_start": "0",
            "b_size": "2",
        }

        response = requests.get(
            self.document.absolute_url(),
            headers={"Accept": "application/json"},
            params=payload,
            auth=(SITE_OWNER_NAME, SITE_OWNER_PASSWORD),
        )

(link to my playground in plone.rest: https://github.com/plone/plone.rest/pull/125/files)

Though, there are other libraries (like "qs") which seem to support nested querystrings:

> var qs = require('qs');
undefined
> query = {
...     "query": [
...         {
.....             "i": "Title",
.....             "o": "plone.app.querystring.operation.string.is",
.....             "v": "Welcome to Plone",
.....         },
...         {
.....             "i": "path",
.....             "o": "plone.app.querystring.operation.string.path",
.....             "v": "/news",
.....         },
...     ],
...     "sort_on": "sortable_title",
...     "sort_order": "reverse",
...     "limit": "10",
...     "fullobjects": "False",
...     "b_start": "0",
...     "b_size": "2",
... }
{
  query: [
    {
      i: 'Title',
      o: 'plone.app.querystring.operation.string.is',
      v: 'Welcome to Plone'
    },
    {
      i: 'path',
      o: 'plone.app.querystring.operation.string.path',
      v: '/news'
    }
  ],
  sort_on: 'sortable_title',
  sort_order: 'reverse',
  limit: '10',
  fullobjects: 'False',
  b_start: '0',
  b_size: '2'
}
> var str = qs.stringify(query)
undefined
> console.log(str)
query%5B0%5D%5Bi%5D=Title&query%5B0%5D%5Bo%5D=plone.app.querystring.operation.string.is&query%5B0%5D%5Bv%5D=Welcome%20to%20Plone&query%5B1%5D%5Bi%5D=path&query%5B1%5D%5Bo%5D=plone.app.querystring.operation.string.path&query%5B1%5D%5Bv%5D=%2Fnews&sort_on=sortable_title&sort_order=reverse&limit=10&fullobjects=False&b_start=0&b_size=2
var out = qs.parse(str)
undefined
> console.log(out)
{
  query: [
    {
      i: 'Title',
      o: 'plone.app.querystring.operation.string.is',
      v: 'Welcome to Plone'
    },
    {
      i: 'path',
      o: 'plone.app.querystring.operation.string.path',
      v: '/news'
    }
  ],
  sort_on: 'sortable_title',
  sort_order: 'reverse',
  limit: '10',
  fullobjects: 'False',
  b_start: '0',
  b_size: '2'
}

See https://github.com/ljharb/qs#parsing-objects for details.

Let me know what you think.

sneridagh commented 3 years ago

At first sight, I guess it's better to use something that supports it, so no conversion whatsoever needed (in either side).

Fact: qs: 6k stars,query-string: 5k. Also, Sindre Sorhus is known to be quite opinionated.

If we choose to change to qs, how about the Python parser? Does it detects the nesting well?

jensens commented 3 years ago

Back to work after a day off and thinking further.

sneridagh commented 3 years ago

@jensens +1 to all.

tiberiuichim commented 3 years ago

Side info, but maybe good to know: we use 'paqo' in the serialized p.a.querystring query for the URL, in search block: https://github.com/plone/volto/blob/master/src/components/manage/Blocks/Search/hocs/withSearch.jsx#L177

robgietema commented 1 year ago

The HTTP spec doesn't say you can't send data with a GET request but it all depends on the frameworks used if it is supported or not. Another issue with this approach is that the requests can't be cached. If you use query string parameters you are able to cache the requests. There are however things to take in to account when using the query string for this;

Complex data formats

JSON is not supported by default as a query string. There is also no "global" standard on how to do this. There are multiple methods of encoding a JSON object to a string, let's take the following query:

  query: [
    {
      i: 'Title',
      o: 'plone.app.querystring.operation.string.is',
      v: 'Welcome to Plone'
    },
    {
      i: 'path',
      o: 'plone.app.querystring.operation.string.path',
      v: '/news'
    }
  ],
  sort_on: 'sortable_title',
  sort_order: 'reverse',
  limit: '10',
  fullobjects: 'False',
  b_start: '0',
  b_size: '2'
};

Which results in the following transforms:

Encode URI: %7B%22query%22%3A%5B%7B%22i%22%3A%22Title%22%2C%22o%22%3A%22plone.app.querystring.operation.string.is%22%2C%22v%22%3A%22Welcome%20to%20Plone%22%7D%2C%7B%22i%22%3A%22path%22%2C%22o%22%3A%22plone.app.querystring.operation.string.path%22%2C%22v%22%3A%22%2Fnews%22%7D%5D%2C%22sort_on%22%3A%22sortable_title%22%2C%22sort_order%22%3A%22reverse%22%2C%22limit%22%3A%2210%22%2C%22fullobjects%22%3A%22False%22%2C%22b_start%22%3A%220%22%2C%22b_size%22%3A%222%22%7D (453)
Rison:      (b_size:'2',b_start:'0',fullobjects:False,limit:'10',query:!((i:Title,o:plone.app.querystring.operation.string.is,v:'Welcome to Plone'),(i:path,o:plone.app.querystring.operation.string.path,v:/news)),sort_on:sortable_title,sort_order:reverse) (242)
O-Rison:    b_size:'2',b_start:'0',fullobjects:False,limit:'10',query:!((i:Title,o:plone.app.querystring.operation.string.is,v:'Welcome to Plone'),(i:path,o:plone.app.querystring.operation.string.path,v:/news)),sort_on:sortable_title,sort_order:reverse (240)
JSURL:      ~(query~(~(i~'Title~o~'plone.app.querystring.operation.string.is~v~'Welcome*20to*20Plone)~(i~'path~o~'plone.app.querystring.operation.string.path~v~'*2fnews))~sort_on~'sortable_title~sort_order~'reverse~limit~'10~fullobjects~'False~b_start~'0~b_size~'2) (253)
QS:         query[0][i]=Title&query[0][o]=plone.app.querystring.operation.string.is&query[0][v]=Welcome to Plone&query[1][i]=path&query[1][o]=plone.app.querystring.operation.string.path&query[1][v]=/news&sort_on=sortable_title&sort_order=reverse&limit=10&fullobjects=False&b_start=0&b_size=2 (279)
URLON:      $query@$i=Title&o=plone.app.querystring.operation.string.is&v=Welcome%20to%20Plone;&$i=path&o=plone.app.querystring.operation.string.path&v=//news;;&sort_on=sortable_title&sort_order=reverse&limit=10&fullobjects=False&b_start=0&b_size=2 (236)

The encode uri method is in my opinion the safest and most straight forward implementation and shouldn't be hard to implement both on the frontend and the backend.

Length

Altough the HTTP spec doesn't limit the length of a querystring all browsers do. Currenty the following limits are in place:

Looking at the above examples even the largest (url encoding) is "only" 453 characters so even if you want to support Internet Explorer almost all requests should work.

sneridagh commented 1 year ago

LGTM, let's discuss this.

Also, FYI, I just remembered that this PR is in place: https://github.com/plone/volto/pull/4159

No clue how they are encoding the state in the URL. I guess we should then sync before merging.

tisto commented 1 year ago

@robgietema thanks for the write-up. We have to take into account the F5 Big IP component:

https://my.f5.com/manage/s/article/K60450445 (default is 2048 bytes)

jensens commented 1 year ago

We could think about shorting strings like plone.app.querystring.operation.string.is.

tiberiuichim commented 1 year ago

@jensens see https://github.com/plone/plone.restapi/issues/1252#issuecomment-957414979

ericof commented 1 year ago

@tisto Should we consider this done?