Open thomasegense opened 2 years ago
Ah, the code was not falling back on the original served content type when format ID returned application/octet-stream
(only when format ID explicitly failed and returned an empty string). I've added a test that I think reproduced the behaviour, and modified the code to fix the issue.
Is it possible for you to verify it's fixed when running on real data?
Thanks, I will try test with latest master branch.
@anjackson Sorry, but the bug is still here. I build the latest version of master with your fix.
Here is a small WARC that has the video (and a few other resources): https://drive.google.com/file/d/1s7NUo0BntJgThdnwh953KLfUIKzs6zcw/view?usp=sharing
Hm, weird. Just indexed that WARC and got:
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"content_type_served:\"video/mp4\"",
"indent":"on",
"fl":"url,content_type*",
"rows":"100",
"wt":"json",
"_":"1659521636272"}},
"response":{"numFound":1,"start":0,"docs":[
{
"content_type_ext":"mp4",
"content_type_served":"video/mp4",
"content_type":["video/mp4"],
"content_type_droid":"application/mp4",
"content_type_tika":"video/mp4",
"content_type_full":"video/mp4",
"content_type_norm":"video",
"url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}]
}}
I mean, there were other problems, but that bit seemed to work.
I tried again and still got same result. See Solr reply below I am using this commit: commit 81acb31b110549dc624e39098b71af1b6320a5e3 (HEAD -> master, origin/master, origin/HEAD) Add test for #289 and fall-back on the served content type when format ID fails.
Can you assign to Toke? He will try test it also (tomorrow probably)
{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"content_type_served:\"video/mp4\"", "fl":"id,content_type_norm,url", "_":"1659522744689"}}, "response":{"numFound":1,"start":0,"docs":[ { "content_type_norm":"other", "id":"20220803065503/lSIHRjv/b3vWGR6zGjQLMw==", "url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}] }}
The video is still on this live url:
https://sommansiger.nu/img/SomManSiger_full.mp4
Here are some of the fields from Solr. It is the last two that have been 'video' instead.
content_type_served : "video/mp4" content_type_full : "application/octet-stream" content_type_ext : "mp4" type : "Other" content_type_norm : "other"
It seems about 13% of mp4 video are classified wrong. From the danish archive using this query:
content_type_ext:mp4 AND content_type_norm:(other OR video)
gives: Video: (2,860,146) Other :(462,163)