ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Support for Arabic language in warc-indexer -> Solr fields #291

Open thomasegense opened 2 years ago

thomasegense commented 2 years ago

I am not sure if this is a duplicate of an existing issue.

When you harvest this url: https://www.youtube.com/watch?v=Hnrdfb6HiK0

The title field in solr is: title":"سيدة الصبر - المرأة العراقية - كريم العراقي - احمد الثرواني - YouTube",

Also other fields such as keywords has the same issue.

anjackson commented 2 years ago

This appears to be a problem with Apache Tika, as I get the same results using that directly...

12:39 $ tika watch_v_Hnrdfb6HiK0.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB">
<head>
<link rel="shortcut icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_32x32.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_48x48.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_96x96.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_144x144.png"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&amp;family=YouTube+Sans:wght@300..900&amp;display=swap"/>
<link rel="stylesheet" href="/s/player/7a7465f5/www-player.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-watch-page-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-player-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-onepick.css"/>
<link rel="search" type="application/opensearchdescription+xml" href="https://www.youtube.com/opensearch?locale=en_GB"/>
<link rel="manifest" href="/manifest.webmanifest"/>
<link rel="canonical" href="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="handheld" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="shortlinkUrl" href="https://youtu.be/Hnrdfb6HiK0"/>
<link rel="alternate" href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" type="application/json+oembed" href="https://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="alternate" type="text/xml+oembed" href="https://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="image_src" href="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>

<link href="https://www.youtube.com/embed/Hnrdfb6HiK0"/>

<meta name="og:image" content="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>
<meta name="og:image:width" content="1280"/>
<meta name="twitter:card" content="player"/>
<meta name="og:site_name" content="YouTube"/>
<meta name="keywords" content="احمد الثرواني, كريم العراقي, سيدة الصبر, المرأة, شعر, كاظم الساهر, العراق, شعر �صيح, عمل شعري, شعر عربي, Mbc, الشرقية, العراقية, مامون النطاح, شعر شعبي, الثرواني, العراقي, الصبر, سيدة, نساء, جمال, احمد, كريم, شعر جميل, الثرواني احمد, العراقي كريم, الثرواني الثرواني, Bbc, Cnn, دبي, بغداد, مشاهير, صابر الرباعي, ام كلثوم, �يروز, ماجدة الرومي, الشعر, حب, الحب, العشق, عشق"/>
<meta name="twitter:url" content="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<meta name="twitter:app:url:ipad" content="vnd.youtube://www.youtube.com/watch?v=Hnrdfb6HiK0&amp;feature=applinks"/>
<meta name="og:description" content="عمل شعري للمرأة العراقية - الشاعر كريم العراقي و الشاعر احمد الثرواني. تصويرChris Goslig Hans-Ole KirkGotYouBack ApSمونتاجغيث سلمان �كرة وتن�يذ هدى علوانالمو..."/>
<meta name="twitter:player" content="https://www.youtube.com/embed/Hnrdfb6HiK0"/>
<meta name="dc:title" content="سيدة الصبر - المرأة العراقية - كريم العراقي - احمد الثرواني - YouTube"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
...

Note that the Content-Encoding is wrong.

anjackson commented 2 years ago

I suspect this is down to the buffer size used by the CharsetDetector. There's a lot of gumpf before the UTF-8 shows up.

Not sure if this really a bug or if we should find a way to configure a larger buffer/markLimit.

Linking the example HTML file: https://gist.github.com/anjackson/5bf6945b8b557ace07f5cd1d64cbcc4f