klokantech / embedr

Embedr.eu - Image Embedding Service (IES) with support for IIIF, OEmbed, zoomable viewer in an iFrame
http://embedr.eu/
European Union Public License 1.1
15 stars 4 forks source link

All process running? #75

Closed mzeinstra closed 9 years ago

mzeinstra commented 9 years ago

I've done a large ingestion on thursday as you can see in AWS CloudWatch:

screen shot 2015-11-08 at 18 01 14

My records say that I've uploaded metadata for about 360k records. As is only 50k records have been ingested into cloudsearch (searchable documents) and the processing has been reasonable quiet since.

Has the ingestion processes stopped?

o1da commented 9 years ago

All processes have been running but the ingestion didn't do anything - it was waiting for redis. Redis server didn't manage high load. After redis client (ingest processes) buffer overflows redis server stopped to communicate with them. There were an error messages in the log:

1:M 06 Nov 08:26:02.044 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
1:M 06 Nov 08:27:40.059 # Client id=17 addr=172.17.0.8:42091 fd=13 name= age=4151011 idle=1 flags=N db=0 sub=0 psub=1 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=1105 omem=29228352 events=rw cmd=psubscribe scheduled to be closed ASAP for overcoming of output buffer limits.

It had problems with rewriting of persistence AOF file too - it is about 1GB big now. With too much data it can be slow especially under high load of server. But the main problem was with client-buffer I have just double it now config set client-output-buffer-limit "normal 0 0 0 slave 0 0 0 pubsub 67108864 33554432 60" It was "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60".

I have restarted ingestion now and it is working from the point where it stopped before.

Please don't send another approx 40 batches if there is some pending batch yet. You have to take into account that batch processing, ingestion and media web page runs on the same relatively not powerfull machine. Every subprocess needs redis, it holds already ingested items but all pending batches too. Batches are purged from redis after they are completely done - so it can really grows if tens of new batches are started and it would hit performance much.

I had some suggestions related with this performance problems there #55 and they comes from #38

mzeinstra commented 9 years ago

Thanks for the insight. I was mistakenly believing that the batches were just stored until they could be processed. I'll not upload more batches in this manner.

However, it seems redis crashed again after about 5 hours.

o1da commented 9 years ago

Yes redis was overwhelmed again, I had to restart it half an hour ago. I'm afraid that we will have problems until most of batches will be done, because there are simply too much waiting ingestion requests in the redis - celery constantly checks them and it makes much more traffic to the redis that redis can operate especially for longer time.

I'm going to constantly check it today, maybe I could try different configuration for redis, but I'm not sure if it help us, because I spent much time with redis tweaking during #38 - and I don't know about anything what I can set better :(

mzeinstra commented 9 years ago

I'm sorry about this.

We could also remove the bulk of the still in progress batches. I still have them and can upload them separately.

klokan commented 9 years ago

@o1da please start a separate machine to do only the ingest tasks - this was confirmed by @mzeinstra. It will run permanently, until we stop it manually again on @mzeinstra request.

To summarise the problem here:

We definitely should not run the ingest tasks on the same machine in production system and this was never intended for use after production launch. Existing setup was optimised to save costs when only few small ingests were done on embedr during testing.

When running the large ingest tests in the past (under #38) - we manually launched several separate ingest machines temporarily. Now we will start a single ingest machine permanently - to prevent overloading of the web/db/api machine. Ingest of 300k+ files will take a while with only a single dedicated machine, but the issue reported here should not happen again. This ticket could be closed then.

For production launch (where multiple people can ingest) it is highly recommended to configure #55 Autoscaling for Ingest - same as we have autoscaling for serving the IIIF now...

o1da commented 9 years ago

I have just started ingest slave with one c4.xlarge instance and have stopped ingestion on the media.embedr.eu server. So ingest slave has almost 100% CPU utilization all time and media.embedr.eu does almost nothing.

It is OK now, I will check it lately and close this ticket.

o1da commented 9 years ago

We have approx 215k of ingested images now, it works smoothly. But I had to solve another issue with c4.xlarge instance. This instance has much more powerful CPU than instances used in #38, so problem with speed of disk I/O appeared. Disk on ingest instance is used as temporary storage for downloaded original images and for already being processed ones.

graph

There is a clear evidence of impact of slow disk I/O operations (I/O queue is long) on CPU utilization. CPU have to wait for disk, even simple ls command was extremely slow in this situation (tens of seconds). It surprised me because c4 instances have optimalized EBS disk performance (and I had it allowed). Maybe Provisioned IOPS SSD was needed instead of General Purpose SSD, but it can be expensive. So I switched ingest slave to c3.xlarge instance which is slightly less powerful and cheaper, but it has local SSD instead of EBS. Performance impact was huge.

So it is solved now.

mzeinstra commented 9 years ago

It has been quiet in the past two hours. Is there a problem with the ingestion?

screen shot 2015-11-11 at 11 35 15

above, number of searchable documents in cloud search below CPU and DataIn of the ingestion server.

o1da commented 9 years ago

From redis log: 1:M 11 Nov 09:42:15.101 # Client id=123 addr=172.31.25.153:35701 fd=13 name= age=152881 idle=1 flags=N db=0 sub=0 psub=1 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=520 omem=13001122 events=rw cmd=psubscribe scheduled to be closed ASAP for overcoming of output buffer limits.

I restarted redis with smaller client buffers yesterday - I thought it wouldn't be needed if we move ingestion to another server, but it seems that it is needed. So I have increased it permanently.

mzeinstra commented 9 years ago

screen shot 2015-11-11 at 15 20 09

I wanted to know how many records are remaining/and how many records produced other errors, so I have script that keeps track of the batch numbers that are not 100% done. It requests a batch. counts the results and continues to the next batch. This has become unresponsive. I think we crashed redis again.

klokan commented 9 years ago

On other production projects we use Amazon SQS for the task management queue.

It would be relatively easy to move the processing to Amazon SQS Queue. This way we get graphs of remaining tasks directly in CloudWatch (seeing what is the status of encoding, how many tasks are still remaining, etc), without a need to run query scripts to API (which would be still possible as well). What is even more important the auto scaling #55 could run based on the status of the queue easily - and even stop all ingesting machines if there are no tasks to do.

Redis is designed to run on cluster - and split to load to more nodes and sync the database between them. But we have only one node now. SQS would make for tasks management a very good (and unbreakable) alternative.

mzeinstra commented 9 years ago

Why did we choose Redis over your proposed scalable solution? is SQS queue new?

klokan commented 9 years ago

In requirements was strict preference on using open-source projects over online services - to be independent of Amazon - and to be able to move to a different cloud infrastructure later on.

But it seems that Embedr will continue to run on Amazon - and there is already dependency on fulltext and on cluster auto-scaling capabilities anyway... switch to SQS would make sense in the current setup, esp. if Amazon auto-scaling on the ingest is in place too.

Move away from Amazon in future would be still possible of course by replacing Amazon CloudSearch with a custom ElasticSearch instance (which BTW should run on multiple nodes in production systems as well) and returning back to Celery/Redis combination instead of Amazon SQS.

o1da commented 9 years ago

Redis was OK - everything was working. But I have found a bug in the response with batch state - it was very slow in some conditions - I'm going to fix it today - please don't update batches, I will inform you.

o1da commented 9 years ago

It is fixed via sqlite optimization - sql queries for batch state were too slow.

Please keep in mind that the first query would take tens of seconds even a few minutes if you have tens of thousands of items in it. Another query for the same batch ID is fast. It happens because during the first query info about completed ingestion tasks is moved from redis to sqlite.

mzeinstra commented 9 years ago

Are you saying that ingestion has finished and that I should expect around 40k download errors, as I don't see any activity on the ingestion server:

screen shot 2015-11-12 at 16 17 10

mzeinstra commented 9 years ago

Just ran the scripts, yes the remaining 44k have errors:

involving batch IDs: [22, 32, 34, 35, 37, 38, 41, 43, 45, 46, 48, 49, 50, 52, 55, 57, 58, 59, 61, 62, 65, 66, 67, 68, 69, 70]

What does 'another error mean' and can we do something with the 'Interaction with Cloud Search failed' - error?

o1da commented 9 years ago

'Another error' can be whatever error except 'download failed' and 'Interaction with Cloud Search failed' Only way to explore it further is to read logs on the server.

If I look at 'Interaction with Cloud Search failed' and search logs I can usually get this info:

[2015-11-11 12:33:05,294: WARNING/Worker-80] Failed Cloud Search attempt numb.: 9
Item: europeana___2021641__publiek_detail_aspx_xmldescid_55229593
Error message:
###
Traceback (most recent call last):
  File "/usr/local/src/hawk/app/ingest.py", line 245, in finalizeItem
    cloudsearch = getCloudSearch(CLOUDSEARCH_ITEM_DOMAIN, 'document')
  File "/usr/local/src/hawk/app/helper.py", line 62, in getCloudSearch
    return boto.connect_cloudsearch2(region=CLOUDSEARCH_REGION, sign_request=True).lookup(domain).get_document_service()
  File "/usr/local/lib/python2.7/dist-packages/boto/cloudsearch2/layer2.py", line 92, in lookup
    domains = self.list_domains(domain_names=[domain_name])
  File "/usr/local/lib/python2.7/dist-packages/boto/cloudsearch2/layer2.py", line 64, in list_domains
    domain_data = self.layer1.describe_domains(domain_names)
  File "/usr/local/lib/python2.7/dist-packages/boto/cloudsearch2/layer1.py", line 451, in describe_domains
    path='/', params=params)
  File "/usr/local/lib/python2.7/dist-packages/boto/cloudsearch2/layer1.py", line 783, in _make_request
    body=json_body)
JSONResponseError: JSONResponseError: 400 Bad Request
{u'RequestId': u'595930ba-8870-11e5-baf1-5fad4d111d4d', u'Error': {u'Message': u'Rate exceeded', u'Code': u'Throttling', u'Type': u'Sender'}}
###

It means that we are sending requests to the CloudSearch above its limits in some time period. It can occur in case of fast intensive ingesting - when many items are processed parallely. We have configured max 10 repetitions of CloudSearch push until the whole item is marked as 'error'. We can easily increase this number and it would try to send info to CloudSearch more times. And I have implemented increasing random delay between this attempts.

mzeinstra commented 9 years ago

So are these batches still pending ingestion or have they failed permanently?

Is now the time to add some new batches and to add a batch that removes a collection that we found to be not useful for the service or should I wait for this to finish?

o1da commented 9 years ago

Yes they are finished, you can send new batches and probably create batch or two with failed items and try to ingest them again. Especially for items with 'Interaction with Cloud Search failed' because they were correctly processed only except send to Cloud Search.

Should I increase the number of attempts before new ingestion?

mzeinstra commented 9 years ago

Let's keep it like this for now, Let's see if this problem has been solved by other fixes during the past week.

To make sure I do this right If I send all the batches with these problems in again, will that re-ingest images from the ones who were succesful or only replace it's metadata?

o1da commented 9 years ago

If the item is already in DB the image is processed only if URL is changed, so it should only replace the metadata.

But if you send everything again I'm pretty sure that we will receive errors with Cloud Search again because metadata are rewrited there too and without image ingestion it would process items really fast and exhaust Cloud Search limits even faster. And it would have another nasty consequences - if the push to Cloud Search finally (after configured amount of attempts) fails - the item is purged from S3 and local DB too - we store items only in all of services or nowhere.

So if you send all batches again - it can lead to more problems, I suggest you to modify your script which checks failed items from batch to build a new batch only with these uningested items.

klokan commented 9 years ago

We have discussed this now - and we are going to improve this process. Ingest of items with exactly the same metadata and url already in Redis should not trigger Cloud Search updates...

@mzeinstra - please wait until @o1da confirms this has been fixed - and then you should be able to resubmit the same batches again. Items already ingested will be skipped - and only the previously failed items will be processed.

o1da commented 9 years ago

It is done @mzeinstra can send batches again.

mzeinstra commented 8 years ago

Thanks, I'll reupload these batches.

mzeinstra commented 8 years ago

@o1da I did not re-upload the batches as that would be 250mb of data to process. 210MB of that would be data that did not need to be re-uploaded. Instead I did create a single file (batch number 74) that contains the remaining items that had errors.

However, I do not see any ingestion process activity on the ingestion slave. Is it down?

o1da commented 8 years ago

I have just checked ingestion process and your last batches are done. It seems that most of items were added - I can't see any 'Interaction with Cloud Search failed' error.

Can you run your script to identify count of all specific error?

If there would appear errors connected with download or upload you should try send batches 73 and 74 again.

According to the last changes you can send these batches in original state - already ingested items without any modifications don't trigger any "expensive" action.

mzeinstra commented 8 years ago

Looks good, still some errors. I am unsure where the error resides on our side or on the providers side?

494 'another error' 3108 'download failed' 2 'Image processing failed' 2323 'Interaction with S3 failed'

Do you see something that we need to fix?

o1da commented 8 years ago

'Download failed' is usually triggered by urllib2 timeout, so it seems that this problem is on the image provider site or on the network. 'Interaction with S3 failed' seems to be connected with network too.

Can you resend these batches? I would analyze log outputs more deeply.

mzeinstra commented 8 years ago

Just send it in (batch number 76), it seems to process nicely up until now: 8101 ok, 33018 pending 0 errors

but most of those pending are already in the system

o1da commented 8 years ago

Check item europeana_2021657161703, there is some problem on the image provider server. http://cc.museon.nl/imageproxy.aspx?server=localhost&port=17512&filename=images/geschiedenis%5C/161703.jpg

Our log:

 [2015-11-19 18:03:46,752: WARNING/Worker-262] Failed attempt numb.: 1
Item: europeana___2021657__161703
Url: http://cc.museon.nl/imageproxy.aspx?server=localhost&port=17512&filename=images/geschiedenis%5C/161703.jpg
Error message:
###
Traceback (most recent call last):
  File "/usr/local/src/hawk/app/ingest.py", line 100, in ingestQueue
    r = urllib2.urlopen(task.url, timeout=URL_OPEN_TIMEOUT)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 407, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 445, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
###

Or check another error:

[2015-11-19 18:09:12,563: WARNING/Worker-266] Failed attempt numb.: 1
Item: europeana___9200290__BibliographicResource_3000073526441
Url: http://www.bildarchivaustria.at/Preview/16202516.jpg
Error message:
###
Traceback (most recent call last):
  File "/usr/local/src/hawk/app/ingest.py", line 100, in ingestQueue
    r = urllib2.urlopen(task.url, timeout=URL_OPEN_TIMEOUT)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 407, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 445, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
###

Another:

[2015-11-19 18:09:13,208: WARNING/Worker-265] Failed attempt numb.: 1
Item: europeana___9200290__BibliographicResource_3000073526439
Url: http://www.bildarchivaustria.at/Preview/16202397.jpg
Error message:
###
Traceback (most recent call last):
  File "/usr/local/src/hawk/app/ingest.py", line 100, in ingestQueue
    r = urllib2.urlopen(task.url, timeout=URL_OPEN_TIMEOUT)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 401, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 419, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
timeout: timed out
###

These errors aren't on the side of ingestion process. It should be consulted with http://www.bildarchivaustria.at and http://cc.museon.nl - they can have some e.g. traffic restrictions etc.

klokan commented 8 years ago

Idea of future improvement: The complete logs could be added to CloudWatch Logs - so we get fulltext search within last 3 days and perfect overview of what is happening on the cloud without logging in with SSH. This means @mzeinstra could look himself via AWS console - and there may be alerts and metrics derived from the logs (duration of download / encoding / ingest, error alerts, etc). We do that on the recent Airbus satellites world rendering project and on the vector tiles cloud rendering of OpenStreetMaps as well. It will get very important if we run more machines for ingestion #55

mzeinstra commented 8 years ago

@klokan good idea. I'll make a new improvement ticket out of it.

@o1da I understand that some sources do not have proper documents behind them. Frankly I am surprised so many worked in the first placed. When we have an error 'Failed to Download' I understand that this is something I need to pick up. I am wondering if our infrastructure still has a problem with the 'Interaction with S3 failed' error. Or is that also external?

o1da commented 8 years ago

I have just looked at problems with Interaction with S3 failed. I have checked a few items with this info and all of them are JPG images. They can be successfully converted to TIF, but conversion to JPG2000 is failing. But it is failing without error status code it only prints this info:

Error in Kakadu File Format Support:
Did not find a valid curv' data type in the embedded ICC profile's tone
reproduction curve tag.

And continues to S3 upload which fails with error code (and this is the one which is saved), because file with 0B size can't be uploaded.

I'm going to create a new issue with this problem.