Closed frankstrater closed 5 years ago
With "duplicate media_urls", do you mean that there are items that contain identical media_urls
within the same item? Can you provide the id
's of some items where this problem occurs?
It seems I was mistaken. I didn't expect that a result object can have multiple different images (not duplicates), each with their own set of resolutions. For example:
Is this correct? Because that's a lot of data for 1 object and with a default size of 100 objects in search you hit the memory limit pretty soon.
You can lookup the item in it's original form (as returned by the source) by requesting http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea/source to get a better idea of what's going on. It looks like transformation of the item went fine, since the source also shows that there are lots of images associated with the item.
I suspect that there is (or was) something wrong at Natinaal Archief Leiden, since the item no longer exists there (http://hdl.handle.net/10648/37b5754f-5494-4337-a317-3ec5b5ef12cf returns a 404).
Regarding your memory limit problem: to me the response seems to be not that big (± 2 Mb). Maybe you should bump up PHP's memory limit a bit :wink:.
If you think that a json-response for a single object (± 3 Mb filesize) is acceptable, then I wish all users of the RESTful API (not just PHP, think jQuery) happy coding and you can close this issue.
EDIT: I can't imagine any use of an object with 15900 media_urls. Maybe there should be a feauture to limit the total returned media_urls?
This is not an isssue about PHP...
I think the problem here is that we currently can't provide any guarantees about the size of a single item. For example, there may exist items with many URLs, huge descriptions and long lists of items. In situations where the user's application directly calls the REST API on query time (like @frankstrater's search interface), a potentially large response of multiple megabytes is not desirable.
My suggestion is to add an optional filter to the REST API which allows the user to specify the maximal size in bytes of objects that should included in the result set. This prevents us from having to filter out large items on index time, and gives the API user control over the maximal size of the returned response. Additionally I would like to add the option to specify which fields should be returned.
Frank, would this help you?
In this case we are talking about a singel item that has a size of ± 2 Mb. When you happen to issue a query where multiple of these items are returned in a single response, you're easily talking about a response size of multiple megabytes.
Currently all responses are served uncompressed. I will enable GZIP compression.
The problem with specifying a maximum size for objects is that you would like to have at least one media url in your response ...
@justinvw A way to specify which fields should be returned should be added (think statistical dashboard apps). I noticed the json-response is nicely padded. which makes it human-readable, but amounts to a lot of overhead (if I'm not mistaken).
@ajslaghu http://search.opencultuurdata.nl/ uses 42 as size, but should work if you change that to 18. With 128Mb you exceed the memory limit somewhere between 20 and 30 if you search for 'notulen' for example. As of now, 50 is too much to handle.
@breyten But on what basis do you want to select this single URL? Details such as size aren't alway present. Also, this problem isn't specific to the media_urls
field. Other fields can theoretically also contain huge amounts of data.
@frankstrater by default the response content is pretty-printed, but if you send X-Requested-With: XMLHttpRequest
along with the headers, you will receive an unpadded response.
I just enabled GZIP compression, which results in some nice reductions of the response size:
$ curl -H "Accept-Encoding: gzip" http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json.gzip
$ curl http://api.opencultuurdata.nl/v0/nationaal_archief_beeldbank/1bc1ac4c800047243c27179c7620ba27cb9521ea > test.json
$ ls -la test.json*
-rw-r--r-- 1 justin staff 3034699 Jun 17 14:43 test.json
-rw-r--r-- 1 justin staff 494143 Jun 17 14:42 test.json.gzip
I sort of expected something along the line of:
"filters": {
"media_content_type": {
"terms": ["image/jpeg", "image/png"],
"count": 10
}
}
to get the first 10 image media-urls, but I'm not sure if this is feasible.
How about this for a compromise?
search
only the first 10 image urls are returned (as gotten from Elasticsearch)Other fields may be large as well, but I think by limiting the media urls we solve 90-95% of the issue.
@breyten, did you check how many items there are in the index that have more than 10 media_urls
?
Since scripting is disabled, no ;)
breyten@ks206687:~$ curl -s -XPOST 'http://localhost:9201/ocd_combined_index/_search' -d '{
"query": {"match_all": {}},
"filter" : {
"script" : {
"script" : "doc['media_urls'].values.length > param1",
"params" : {
"param1" : 10
}
}
},
"size": 0
}'
Isn't it simpler to decouple possibly long lists into a separate endpoint? It is very natural for the specific case here and perhaps others as well (the classic author/books). Instead of specifying limits for entities that the application had to assert, one can use paging with a sane default page size and references. For search results the media list of a result object would return a reference to its canonical endpoint. This makes the API forwards compatible, as the application only has to be taught how to follow references and how to request previous or next pages.
Some search queries return a lot of duplicate media_urls, leading to an allowed memory size overload when parsing the json response. Might be a caching problem on the ocd_backend. Test scripts to reproduce the bug:
http://strateradvies.nl/ocdsearch/test.php http://strateradvies.nl/ocdsearch/src_test.php