openstate / open-cultuur-data

The back- and front-end code that powers the Open Cultuur Data API
http://opencultuurdata.nl/
28 stars 18 forks source link

Too much media_urls returned in search #45

Closed frankstrater closed 5 years ago

frankstrater commented 10 years ago

Some search queries return a lot of duplicate media_urls, leading to an allowed memory size overload when parsing the json response. Might be a caching problem on the ocd_backend. Test scripts to reproduce the bug:

http://strateradvies.nl/ocdsearch/test.php http://strateradvies.nl/ocdsearch/src_test.php

justinvw commented 10 years ago

With "duplicate media_urls", do you mean that there are items that contain identical media_urls within the same item? Can you provide the id's of some items where this problem occurs?

frankstrater commented 10 years ago

It seems I was mistaken. I didn't expect that a result object can have multiple different images (not duplicates), each with their own set of resolutions. For example:

http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea

Is this correct? Because that's a lot of data for 1 object and with a default size of 100 objects in search you hit the memory limit pretty soon.

justinvw commented 10 years ago

You can lookup the item in it's original form (as returned by the source) by requesting http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea/source to get a better idea of what's going on. It looks like transformation of the item went fine, since the source also shows that there are lots of images associated with the item.

I suspect that there is (or was) something wrong at Natinaal Archief Leiden, since the item no longer exists there (http://hdl.handle.net/10648/37b5754f-5494-4337-a317-3ec5b5ef12cf returns a 404).

Regarding your memory limit problem: to me the response seems to be not that big (± 2 Mb). Maybe you should bump up PHP's memory limit a bit :wink:.

frankstrater commented 10 years ago

If you think that a json-response for a single object (± 3 Mb filesize) is acceptable, then I wish all users of the RESTful API (not just PHP, think jQuery) happy coding and you can close this issue.

EDIT: I can't imagine any use of an object with 15900 media_urls. Maybe there should be a feauture to limit the total returned media_urls?

breyten commented 10 years ago
frankstrater commented 10 years ago

This is not an isssue about PHP...

justinvw commented 10 years ago

I think the problem here is that we currently can't provide any guarantees about the size of a single item. For example, there may exist items with many URLs, huge descriptions and long lists of items. In situations where the user's application directly calls the REST API on query time (like @frankstrater's search interface), a potentially large response of multiple megabytes is not desirable.

My suggestion is to add an optional filter to the REST API which allows the user to specify the maximal size in bytes of objects that should included in the result set. This prevents us from having to filter out large items on index time, and gives the API user control over the maximal size of the returned response. Additionally I would like to add the option to specify which fields should be returned.

Frank, would this help you?

ajslaghu commented 10 years ago
  1. I'd would like to think that retrieving say 50 items would be feasable on either channel (mobile / web). But also I take into account possible compressen. Is the 2 / 3,5 MB response compressed or uncompressed?
  2. Our memory limit is 128 Meg. For me that seems pretty much and enough for queries up to 50 items.
justinvw commented 10 years ago

In this case we are talking about a singel item that has a size of ± 2 Mb. When you happen to issue a query where multiple of these items are returned in a single response, you're easily talking about a response size of multiple megabytes.

Currently all responses are served uncompressed. I will enable GZIP compression.

breyten commented 10 years ago

The problem with specifying a maximum size for objects is that you would like to have at least one media url in your response ...

frankstrater commented 10 years ago

@justinvw A way to specify which fields should be returned should be added (think statistical dashboard apps). I noticed the json-response is nicely padded. which makes it human-readable, but amounts to a lot of overhead (if I'm not mistaken).

@ajslaghu http://search.opencultuurdata.nl/ uses 42 as size, but should work if you change that to 18. With 128Mb you exceed the memory limit somewhere between 20 and 30 if you search for 'notulen' for example. As of now, 50 is too much to handle.

justinvw commented 10 years ago

@breyten But on what basis do you want to select this single URL? Details such as size aren't alway present. Also, this problem isn't specific to the media_urls field. Other fields can theoretically also contain huge amounts of data.

@frankstrater by default the response content is pretty-printed, but if you send X-Requested-With: XMLHttpRequest along with the headers, you will receive an unpadded response.

I just enabled GZIP compression, which results in some nice reductions of the response size:

$ curl -H "Accept-Encoding: gzip" http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json.gzip
$ curl http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json
$ ls -la test.json*
-rw-r--r--  1 justin  staff  3034699 Jun 17 14:43 test.json
-rw-r--r--  1 justin  staff   494143 Jun 17 14:42 test.json.gzip
frankstrater commented 10 years ago

I sort of expected something along the line of:

"filters": {
    "media_content_type": {
        "terms": ["image/jpeg", "image/png"],
        "count": 10
    }
}

to get the first 10 image media-urls, but I'm not sure if this is feasible.

breyten commented 10 years ago

How about this for a compromise?

Other fields may be large as well, but I think by limiting the media urls we solve 90-95% of the issue.

justinvw commented 10 years ago

@breyten, did you check how many items there are in the index that have more than 10 media_urls?

breyten commented 10 years ago

Since scripting is disabled, no ;)

breyten@ks206687:~$ curl -s -XPOST 'http://localhost:9201/ocd_combined_index/_search' -d '{
"query": {"match_all": {}},
"filter" : {
    "script" : {
        "script" : "doc['media_urls'].values.length > param1",
        "params" : {
            "param1" : 10
        }
    }
},
"size": 0
}'
melvyn-sopacua commented 10 years ago

Isn't it simpler to decouple possibly long lists into a separate endpoint? It is very natural for the specific case here and perhaps others as well (the classic author/books). Instead of specifying limits for entities that the application had to assert, one can use paging with a sane default page size and references. For search results the media list of a result object would return a reference to its canonical endpoint. This makes the API forwards compatible, as the application only has to be taught how to follow references and how to request previous or next pages.