Ingest produces error without message

mzeinstra commented 9 years ago

I've been testing the limits of the ingestion API. I tried the largest image I could find online ~220mb: https://commons.wikimedia.org/wiki/File:The_Garden_of_Earthly_Delights_by_Bosch_High_Resolution.jpg

I've put that into a batch:

[
    {
        "id": "Museo_Nacional_del_Prado_P02823",
        "institution": "Museo Nacional del Prado, Madrid",
        "institution_link": "https://www.museodelprado.es/en/the-collection/online-gallery/on-line-gallery/obra/the-garden-of-earthly-delights/",
        "url": [
            "https://upload.wikimedia.org/wikipedia/commons/6/6d/The_Garden_of_Earthly_Delights_by_Bosch_High_Resolution.jpg"
        ],
        "license": "http://creativecommons.org/publicdomain/zero/1.0/",
        "source": "https://www.museodelprado.es/en/the-collection/online-gallery/on-line-gallery/obra/the-garden-of-earthly-delights/",
        "title": "The Garden of Earthly Delights",
        "creator": "Hieronymus Bosch",
        "description": "The open triptych shows three scenes. The left panel is dedicated to Paradise, with the creation of Eve and the fountain of life, while the right panel shows hell. The central panel gives its name to the entire piece, representing a garden of life’s delights or pleasures. Between paradise and hell, these delights are nothing more than allusions to sin, showing humankind dedicated to diverse worldly pleasures. There are clear and strongly erotic representations of lust, along with others, whose meanings are more enigmatic. The fleeting beauty of flowers and the sweetness of fruit transmit a message of fragility and the ephemeral character of happiness and enjoyment. This seems to be corroborated by certain groups, such as the couple enclosed in a crystal ball on the left, which probably alludes to the popular Flemish saying: \"happiness is like glass, it soon breaks\"."
    }
]

This produces an error after ~ 30 minutes:

[
  {
    "status": "error",
    "id": "Museo_Nacional_del_Prado_P02823"
  }
]

According to the wiki I should get a message that details the error: https://github.com/klokantech/hawk/wiki/C.Ingest ("error": The image can not be downloaded or transcoded. The "message" field will specify details.)

o1da commented 9 years ago

This info on wiki seems to be inaccurate, I forgot to update it, this info is there from very start of project. We haven't discussed possibility to show exact error info on /ingest page. It isn't stored in database, it is available via docker-compose logs if needed.

It took approx. 30 minutes because one image is tried 5x (in default configuration) times before error is written. And there is a random few minutes long time between attempts.

Log says that this big image can't be properly identified and converted. I'm going to test it locally to see if image is e.g. corrupted.

o1da commented 9 years ago

Oh I see, your's link isn't a link to image it is a link to html page with thumbnail and some info. So html page was downloaded and it wasn't been correctly identified as image. Correct link for huge image is https://upload.wikimedia.org/wikipedia/commons/6/6d/The_Garden_of_Earthly_Delights_by_Bosch_High_Resolution.jpg

mzeinstra commented 9 years ago

Isn't that the link I provided?

"url": [
"https://upload.wikimedia.org/wikipedia/commons/6/6d/The_Garden_of_Earthly_Delights_by_Bosch_High_Resolution.jpg"
],

o1da commented 9 years ago

I'm sorry I was looking on different one, I'm going to try it.

mzeinstra commented 9 years ago

Regardless, what are we going to do about the error message, if it is in the original plan I would like to have the API output them as described in the wiki.

I don't think I have access docker compose logs.

o1da commented 9 years ago

These logs are available on media.embedr.eu server itself. This server can be accessed directly via ssh with AWS generated keypair and registered with EC2 instance. I can set different approaches for ssh but one with certificate is the most secure.

We should discuss better error message display with @klokan

I have just realized problem with huge image too. Conversion from jpg to tif (which is needed before compression to jp2) is being killed constantly due memory overload. It tries to make 1.5GB big tif from 233MB jpg and it needs more than 4GB memory (which is on media.embedr.eu, but it is shared across whole server). There is no problem with ingestion of this file on my local computer with 16GB memory.

So if there is a need to process such a big images we have to buy more powerful (and more expensive) EC2 instance.

klokan commented 9 years ago

This will be implemented (probably in the first half of next week).

It will run for newly ingested batches. Probably only two error messages will be present:

"Download failed"
"Encoding into JPEG2000 failed"

klokan commented 9 years ago

Aha. Looking back...

There is again complication because of the introduced sequences - each image in the sequence may have different error message...

This is the reason why we have removed this from the code. For sequences the message should be a list of all "error" messages? I expect no message field in case of success.

@mzeinstra please comment if this really bring a value to the scripts which will use the API - and if we should implement this. @mzeinstra please comment whether you want larger EC2 instance for the media.embedr.eu with the existing situation.

Also - we have asked you to deliver the samples with large and various expected images in the ticket #20 back in April... If similar samples would be provided at that time, we would have develop against these during BETA phase, now after the project has been turned on to production any change is slightly more complicated...

klokan commented 9 years ago

@mzeinstra please comment.

mzeinstra commented 9 years ago

On 1. It does bring value to the API to know why something failed. For example we did not know that encoding failed or downloading failed in the example above.

On 2. No I do not require a large EC2 instance to solve this issue of large images. I was pushing the limit of the service and I did expect to hit a ceiling somewhere. It is good to know that there is a ceiling. Once we come across these for one of the institutions we are ingesting this will become a new problem. I don't expect that to happen this year, as I had to really search for this large image.

I know this can frustrate as we could have avoided this back in April, but at that time we didn't have access to these kind of records.

o1da commented 9 years ago

Message with more error info is added, a new ingested items will show it.

klokantech / embedr

Ingest produces error without message #62