Open thomasegense opened 2 years ago
This appears to be a problem with Apache Tika, as I get the same results using that directly...
12:39 $ tika watch_v_Hnrdfb6HiK0.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB">
<head>
<link rel="shortcut icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_32x32.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_48x48.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_96x96.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_144x144.png"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&family=YouTube+Sans:wght@300..900&display=swap"/>
<link rel="stylesheet" href="/s/player/7a7465f5/www-player.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-watch-page-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-player-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-onepick.css"/>
<link rel="search" type="application/opensearchdescription+xml" href="https://www.youtube.com/opensearch?locale=en_GB"/>
<link rel="manifest" href="/manifest.webmanifest"/>
<link rel="canonical" href="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="handheld" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="shortlinkUrl" href="https://youtu.be/Hnrdfb6HiK0"/>
<link rel="alternate" href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" type="application/json+oembed" href="https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="alternate" type="text/xml+oembed" href="https://www.youtube.com/oembed?format=xml&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="image_src" href="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>
<link href="https://www.youtube.com/embed/Hnrdfb6HiK0"/>
<meta name="og:image" content="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>
<meta name="og:image:width" content="1280"/>
<meta name="twitter:card" content="player"/>
<meta name="og:site_name" content="YouTube"/>
<meta name="keywords" content="اØمد الثرواني, كريم العراقي, سيدة الصبر, المرأة, شعر, كاظم الساهر, العراق, شعر Ù�صيØ, عمل شعري, شعر عربي, Mbc, الشرقية, العراقية, مامون النطاØ, شعر شعبي, الثرواني, العراقي, الصبر, سيدة, نساء, جمال, اØمد, كريم, شعر جميل, الثرواني اØمد, العراقي كريم, الثرواني الثرواني, Bbc, Cnn, دبي, بغداد, مشاهير, صابر الرباعي, ام كلثوم, Ù�يروز, ماجدة الرومي, الشعر, Øب, الØب, العشق, عشق"/>
<meta name="twitter:url" content="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<meta name="twitter:app:url:ipad" content="vnd.youtube://www.youtube.com/watch?v=Hnrdfb6HiK0&feature=applinks"/>
<meta name="og:description" content="عمل شعري للمرأة العراقية - الشاعر كريم العراقي Ùˆ الشاعر اØمد الثرواني. تصويرChris Goslig Hans-Ole KirkGotYouBack ApSمونتاجغيث سلمان Ù�كرة وتنÙ�يذ هدى علوانالمو..."/>
<meta name="twitter:player" content="https://www.youtube.com/embed/Hnrdfb6HiK0"/>
<meta name="dc:title" content="سيدة الصبر - المرأة العراقية - كريم العراقي - اØمد الثرواني - YouTube"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
...
Note that the Content-Encoding
is wrong.
I suspect this is down to the buffer size used by the CharsetDetector. There's a lot of gumpf before the UTF-8 shows up.
Not sure if this really a bug or if we should find a way to configure a larger buffer/markLimit.
Linking the example HTML file: https://gist.github.com/anjackson/5bf6945b8b557ace07f5cd1d64cbcc4f
I am not sure if this is a duplicate of an existing issue.
When you harvest this url: https://www.youtube.com/watch?v=Hnrdfb6HiK0
The title field in solr is: title":"سيدة الصبر - المرأة العراقية - كريم العراقي - اØمد الثرواني - YouTube",
Also other fields such as keywords has the same issue.