ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

warc-indexer. video mp4 file classified as "other" #289

Open thomasegense opened 2 years ago

thomasegense commented 2 years ago

The video is still on this live url:

https://sommansiger.nu/img/SomManSiger_full.mp4

Here are some of the fields from Solr. It is the last two that have been 'video' instead.

content_type_served : "video/mp4" content_type_full : "application/octet-stream" content_type_ext : "mp4" type : "Other" content_type_norm : "other"

It seems about 13% of mp4 video are classified wrong. From the danish archive using this query:

content_type_ext:mp4 AND content_type_norm:(other OR video)

gives: Video: (2,860,146) Other :(462,163)

anjackson commented 2 years ago

Ah, the code was not falling back on the original served content type when format ID returned application/octet-stream (only when format ID explicitly failed and returned an empty string). I've added a test that I think reproduced the behaviour, and modified the code to fix the issue.

Is it possible for you to verify it's fixed when running on real data?

thomasegense commented 2 years ago

Thanks, I will try test with latest master branch.

thomasegense commented 2 years ago

@anjackson Sorry, but the bug is still here. I build the latest version of master with your fix.

Here is a small WARC that has the video (and a few other resources): https://drive.google.com/file/d/1s7NUo0BntJgThdnwh953KLfUIKzs6zcw/view?usp=sharing

anjackson commented 2 years ago

Hm, weird. Just indexed that WARC and got:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"content_type_served:\"video/mp4\"",
      "indent":"on",
      "fl":"url,content_type*",
      "rows":"100",
      "wt":"json",
      "_":"1659521636272"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "content_type_ext":"mp4",
        "content_type_served":"video/mp4",
        "content_type":["video/mp4"],
        "content_type_droid":"application/mp4",
        "content_type_tika":"video/mp4",
        "content_type_full":"video/mp4",
        "content_type_norm":"video",
        "url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}]
  }}

I mean, there were other problems, but that bit seemed to work.

thomasegense commented 2 years ago

I tried again and still got same result. See Solr reply below I am using this commit: commit 81acb31b110549dc624e39098b71af1b6320a5e3 (HEAD -> master, origin/master, origin/HEAD) Add test for #289 and fall-back on the served content type when format ID fails.

Can you assign to Toke? He will try test it also (tomorrow probably)

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"content_type_served:\"video/mp4\"", "fl":"id,content_type_norm,url", "_":"1659522744689"}}, "response":{"numFound":1,"start":0,"docs":[ { "content_type_norm":"other", "id":"20220803065503/lSIHRjv/b3vWGR6zGjQLMw==", "url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}] }}