Run header check for images _2.jpg derivatives

gamontoya commented 7 years ago

Descriptive summary

Create a report: Run a script to crawl _2.jpg derivatives and look for those that have their Content-type: application/octet-stream.

Re-generate those derivatives to be Content-Type: image/jpeg

Note: Report should include the Collection Name so that Gabriela will know which collections to re-harvest at Calisphere.

Rationale

Calisphere will not ingest/harvest our collections if it finds that an image has the wrong mime type (Content-Type: application/octet-stream)

HTTP/1.1 200 OK
Date: Thu, 15 Jun 2017 23:51:32 GMT
Server: Apache/2.2.15 (Red Hat)
Cache-Control: max-age=0, private, must-revalidate
Content-Disposition: inline; filename=bb0478952m_2.jpg
X-Powered-By: Phusion Passenger 5.0.30
Content-Length: 350401
Status: 200 OK
Content-Type: application/octet-stream

mcritchlow commented 7 years ago

Please work with @ucsdlib/operations as needed.

rstanonik commented 7 years ago

What url/app does Callisphere use to harvest from us? The Content-Type headers aren't in the 2.jpg file, but are added based on metadata from somewhere (triplestore or solr).

hweng commented 7 years ago

@gamontoya Do you use https://library.ucsd.edu/dc/object/bb0717941z/_2.jpg/download to harvest at Callisphere?

@rstanonik For HTTP protocol, Servers insert the MIME header at the beginning of any Web transmission. I found in in DAMSPAS, file_controller.rb, it set headers['Content-Type'] = 'application/octet-stream’ when render the file:

https://github.com/ucsdlib/damspas/blob/develop/app/controllers/file_controller.rb#L58

which could change to Content-Type: image/jpeg.

gamontoya commented 7 years ago

@hweng We use: https://library.ucsd.edu/dc/object/bb0717941z/_2.jpg

Do you need anything else for this ticket?

hweng commented 7 years ago

@gamontoya Do you think the _2.jpg image itself looks fine to you? I checked several _2.jpg image with "application/octet-stream" MIME type that they look fine to me. The 2.jpg file itself doesn't include MIME type. The MIME type is run by JHOVE to insert into triplestore. If the source problem is not caused by derivatives generation by image magic, but caused by Jhove. Then we could try to re-run Jhove to fix the "application/octet-stream" in triplestore.

gamontoya commented 7 years ago

@hweng The _2.jpg looks fine. The problem is that the server is reporting a content-type of Content-Type: application/octet-stream not Content-Type: image/jpeg so your suggestion to try to re-run JHove might solve the issue.

hweng commented 7 years ago

@gamontoya I tried both way for Dr. Seuss Political Cartoons collection on QA:

Re-run JHove through Damsmanager.(In object tab, choose Jhove Report /Update technical metadata/Extract and add Jhove metadata)
Re-create derivatives and then re-run JHove through Damsmanager.

But it doesn't fix the problem, the mimeType is still "application/octet-stream" for _2.jpg in triplestore, for example bb2185136v in Dr. Seuss Political Cartoons collection http://libraryqa.ucsd.edu/dc/collection/bb2185136v/data#tview

Could you try to run the above example collection through Damsmanager to see what you get? If the mimeType metadata cannot be fixed by doing this way, I will figure out the other way to fix it.

gamontoya commented 7 years ago

@hweng Okay, I'm on it.

gamontoya commented 7 years ago

@hweng I tried both methods listed above and I still got the application/octet-stream.

Note: When Longshou recently recreated a couple of _2 derivatives for me, it fixed the problem. What did he do differently?

hweng commented 7 years ago

@gamontoya Here is a list of ARKs and its collection which the .jpg derivatives have Content-type: application/octet-stream: jpg_with_wrong_mime_type.txt

gamontoya commented 7 years ago

@hweng Not bad, only 89 total to fix.

hweng commented 7 years ago

@gamontoya However it distributed over 12 collections which needs to run one by one collection.

hweng commented 7 years ago

@gamontoya The "Content-Type: application/octet-stream" is not only exist in _2.jpg, it also be found in other derivatives. I've been running the process re-generating derivatives and JHOVE for those problem derivatives which is found distributing over 12 collections.

We don't yet have any function to feed in a list of ARKs(for those problem files) and run derivatives creation and JHOVE. Our available procedure is that you have to run collection by collection to delete the exist file then run the derivatives creation for the collection, which is not very efficient. Matt and I talked that we will add this function to DAMS 5.

So far, 11 collections are done, the SIO is still in processing. I will keep you updated.

hweng commented 7 years ago

@gamontoya All problem .jpg has been re-created and replaced. I've run query to search in solr index and found 0 results for the "Content-Type: application/octet-stream" associated with .jpg. And also I looked into some random picked file to make sure they are good. Would you checked and let me know if it looks fine to you?

And https://library.ucsd.edu/dc/object/bb0717941z/_2.jpg will be used to re-harvest at Calisphere, correct?

gamontoya commented 7 years ago

@hweng Excellent. I'll check now and if I don't find any, I will close this ticket.

Yes, Calisphere will use the _2.jpg file version for harvesting.

hweng commented 7 years ago

@gamontoya There is the last step to sync lib-hydratail-prod:8983/solr to lib-metadata:8983/solr. I have Ron running the sync now. After the sync done, I will run query to search in lib-metadata:8983/solr to make sure it is fine.

hweng commented 7 years ago

@gamontoya The sync is done. I run query to lib-metadata:8983/solr and it looks good to me now.

gamontoya commented 7 years ago

@hweng They sync was done to prod?

hweng commented 7 years ago

@gamontoya Yes. Both lib-hydratail-prod:8983/solr and lib-metadata:8983/solr are prod. The later one is for public.

gamontoya commented 7 years ago

@hweng Perfect! Great work Huawei.

hweng commented 7 years ago

@gamontoya The 283 items from Lambert collection which missed technique metadata have also be re-generated derivatives and re-extracted Jhove. I just checked them in Solr that they look fine now. Could you try again? Thanks!

gamontoya commented 7 years ago

@hweng Thank you. I'll try the re-harvest now.

gamontoya commented 7 years ago

@hweng All images successfully harvested. Thank you Huawei for your patience!

ucsdlib / damspas

Run header check for images _2.jpg derivatives #325

Descriptive summary

Rationale