openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
288 stars 73 forks source link

Default scraping of zh.wiki content should be mainland simplified characters #840

Open Popolechien opened 5 years ago

Popolechien commented 5 years ago

There are several versions of Chinese cohabiting on zh.wikipedia (mainland simplified, HongKong traditional, Macao traditional, Malaysian simplified, Singapore simplified, Taiwan traditional). The wiki switches easily from on to the other as only characters are impacted (not grammar).

I'm not entirely sure how this works on our end but from many reports I get we're not serving people with mainland traditional (大陆简体) content, which is both easier to use and has a lot more readers. It would be nice, therefore, to scrape it correctly. I also remember that this very issue was raised a while back, but I could not find it anywhere.

This bug has been reported a long time ago at https://sourceforge.net/p/kiwix/feature-requests/857/ and https://sourceforge.net/p/kiwix/bugs/733/

kelson42 commented 5 years ago

@Popolechien In which flavour is written the wiki code on zh.wikipedia?

Popolechien commented 5 years ago

If it's anything like German, in the language used by the initial editor. But as I said, you can change settings so as to make it default to read in one script or the other and never know who wrote what in which language. With our latin alphabet the closest comparison I can think of would be to change fonts or replace automatically the German "ß" with the Swiss "ss".

Popolechien commented 5 years ago

@fantasticfears might be able to help.

kelson42 commented 5 years ago

@Popolechien OK, so we have different flavours in the wiki code, its autodetected and then displayed like the user wants?

Popolechien commented 5 years ago

yup.

kelson42 commented 5 years ago

@Popolechien So what is currently the flavour of WPZH ZIM file? Bug from 2014 reported that is was simply mixed (like the wiki code) but not sure this is still the case. Anyway, the Wikimedia backend does not support this kind of transformation on demand. See https://phabricator.wikimedia.org/T43716

Popolechien commented 5 years ago

Just checked and the last ZIM we (zh_all_2019_06) have seems to be mainland simplified - at this stage we should simply wait for a native user/speaker to clarify things.

erickguan commented 5 years ago

This is called language variants. The conversion process basically includes script conversion and word conversion. I tried wikipedia_zh_basketball_nopic_2019-06.zim and believe it's rendered from the source script so it's not what people see online. Does kiwix grab the source script and render it locally?

erickguan commented 5 years ago

@ISNIT0 Maybe Joe has some recommendations?

kelson42 commented 5 years ago

@fantasticfears Like written a long time ago in the other ticket, language variants are not supported by Parsoid. So for now, there is nothing we can do.

erickguan commented 5 years ago

That's sad but the rendering engine from WMF is switching to PHP. How much work we have to use PHP parser?

kelson42 commented 5 years ago

@fantasticfears If Parsoide/MCS moves to PHP, that will probably need no or only a few work on our side as the API is table. But all of this is in the future, and we have no control on this, see upstream ticket in earlier comment.

erickguan commented 5 years ago

Original ticket for language variants is here: https://phabricator.wikimedia.org/T43716. The PHP parsoid is ongoing but the language variants will be the other work but we can check that out later for less effort.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

erickguan commented 4 years ago

Happy new year. It's time to restart the topic. Parsoid-PHP is deployed. But that doesn't move things for us entirely. Now I examined the downloader and I think visualeditor won't expose something we need but mobile-section does the trick. Accept-Language needs to be set to zh-cn to get our dump.

Now the returned value is a scramble.

<p data-mw-variant-lang=\"zh-cn\">可<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;說&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;说&quot;}],&quot;rt&quot;:true}\">说</span>是一<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;種&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;种&quot;}],&quot;rt&quot;:true}\">种</span>衍生自<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;親&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;亲&quot;}],&quot;rt&quot;:true}\">亲</span>人之<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;間&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;间&quot;}],&quot;rt&quot;:true}\">间</span>的<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;強&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;强&quot;}],&quot;rt&quot;:true}\">强</span>烈<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;關愛&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;关爱&quot;}],&quot;rt&quot;:true}\">关爱</span>、<a href=\"/wiki/忠诚\" title=\"忠诚\" class=\"mw-redirect\" data-mw-variant-orig=\"忠誠\">忠<span data-mw-variant-lang=\"zh-hans\" data-mw-variant=\"{&quot;twoway&quot;:[{&quot;l&quot;:&quot;zh-hans&quot;,&quot;t&quot;:&quot;誠&quot;},{&quot;l&quot;:&quot;zh-cn&quot;,&quot;t&quot;:&quot;诚&quot;}],&quot;rt&quot;:true}\">诚</span></a>及善意的情感

Mind a PR and some cleansing code at downloader? I would like to hear some tips and tricks about how you deploy the mwoffliner for scrapping (especially the downloader config part)

danielzgtg commented 4 years ago

It is not a good idea to just convert everything to Simplified Chinese.

First of all, such conversion is lossy and irreversible. For example, both 發 (fa1, to send) and 髮 (fa3, hair) simplify to 发. It would be hard to recover this information if we do this.

Simplification is also a very sensitive political topic, and it would be better to not risk offending or excluding anybody. By leaving the characters as-is, we can tell anyone who complains to look at the source article, no matter which side they support.

I recommend sticking to the original suggestion in the SourceForge ticket. That would be to not do any conversion in mwoffliner but to add a converter to each client. This would be consistent to what the online zh.wikipedia.org is doing. It would give users the freedom to choose and not impose a decision on them.

erickguan commented 4 years ago

I recommend sticking to the original suggestion in the SourceForge ticket. That would be to not do any conversion in mwoffliner but to add a converter to each client. This would be consistent to what the online zh.wikipedia.org is doing. It would give users the freedom to choose and not impose a decision on them.

No, it's not how zh.wikipedia.org works. zhwiki_p works with huge sets of manually defined rules with prefix matching algorithm to convert original articles into the expected language variants. This is not perfect and mistakes happen sometimes. Nevertheless, it works for more than a decade now.

First of all, such conversion is lossy and irreversible. For example, both 發 (fa1, to send) and 髮 (fa3, hair) simplify to 发. It would be hard to recover this information if we do this.

As you can see from the snippet, the conversion rule is defined in data-mw-variant. There are rules to dealing with this type of errors. It's however manually controlled on zhwiki_p.

danielzgtg commented 4 years ago

Nevertheless, it works for more than a decade now.

Ok, if it works. I'm fine with this if I can still read Kiwix in Traditional Chinese (臺灣正體).

erickguan commented 4 years ago

What you are reading now from Kiwix is perhaps the source. By implementing the feature I propose, you should be able to read the correct variant as you intend, for example, Traditional Chinese. But I notice that the issue's title is different than what I propose. What I want to achieve is to have at least two language variants dump for zhwiki.

Popolechien commented 4 years ago

@fantasticfears Feel free to rename the issue if you feel that your solution solves it AND doesn't leave Traditional Chinese out the window (which would definitely be best). Generally speaking, it would probably be better to simply rename it to "zh.wikipedia should also be accessible in simplified chinese" or something similar.

danielzgtg commented 4 years ago

@Popolechien Is there a typo in the issue body?

Title is like this:

Default scraping of zh.wiki content should be mainland simplified characters

Do you mean simplified instead of traditional here (traditional is discouraged in the mainland)?:

[…] not serving people with mainland traditional […]

The Hanzi say "mainland simplified":

大陆简体

You are saying simplified in ⅔ of the places, but traditional in ⅓ of the places.

Popolechien commented 4 years ago

@danielzgtg To clarify: I opened the ticket with the intent of having articles available in 简体字 (simplified) instead of 繁体字 (traditional) everywhere, which seems to be the current default setting.

erickguan commented 4 years ago

@danielzgtg I can put some context as well. People in Mainland read Simplified Chinese with a variant setting (mainland simplified as you said). zhwiki works by providing 6 more localized variants based on regions. The source code can be either simplified and traditional. Or even a combination. These 6 variants use different localized words so they are manually mapped by a set of rules. The pipeline roughly works like 1) convert to simplified/traditional based on variant setting 2) map words based on the dictionary.

My intent would be to provide simplified and traditional (based on Taiwan variant) feature. If additional variants are asked by readers, it will be as easy as adding config. So this differs from what @Popolechien intends but my implementation should cover what you want. I don't want to make simplified version as default but to provide an option for people reads the simplified. IMO, there isn't a need to make anything default. It's not so hard for readers to choose what they want.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

dsd commented 4 years ago

Just curious what is the current state of the zim file. From reading the above, I expected it to be in Traditional Chinese for Taiwan (which is incidentally what I'm looking for right now), with future plans to make the default be mainland/Simplified plus different regional variants available.

However I downloaded wikipedia_zh_top_mini_2020-05.zim and opened it in WebArchives. Looks like it is mainland/simplified. e.g. the page on England says

英国 大不列颠及北爱尔兰联合王国...

So that's using simplified chars e.g. 国 and 颠 (unlike the zh-tw live version)

Popolechien commented 4 years ago

I'm looking at https://library.kiwix.org/wikipedia_zh_top_mini_2020-05 and looking for Germany, the US and other countries I see the title being simplified (德国,美国, etc.) and then the text in traditional (德意志聯邦共和, 美利堅合眾). Also, all articles on vehicles use 車 (trad.), not 车 (simpl.)

Seems like there's a lot of mish-mash of both scripts.

dsd commented 4 years ago

You're right. Thanks for clarifying. I didn't spot the mishmash earlier due to my low familiarity with simplified characters.

dsd commented 4 years ago

Trying to iterate on @erickguan's suggestion, I see that Parsoid is working well:

wget --header="Accept-Language: zh-cn" "https://zh.wikipedia.org/api/rest_v1/page/html/%E8%8B%B1%E5%9B%BD"

This gets a mainland Chinese version (simplified), and if I change to Accept-Language: zh-tw I get a Taiwan version (traditional), I checked by opening the downloaded HTML in a browser. (The is using simplified in both cases but the content is good)</p> <p>So is the next step here just as simple as adding a --accept-language parameter to mwoffliner, then insert the requested language into the HTTP headers in Downloader.ts <code>getRequestOptionsFromUrl</code>?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Popolechien"><img src="https://avatars.githubusercontent.com/u/21675885?v=4" />Popolechien</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>@kelson42 @bakshiutkarsha this is way out of my league, but would the above fix make sense to you?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stale[bot]"><img src="https://avatars.githubusercontent.com/in/1724?v=4" />stale[bot]</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stevenling"><img src="https://avatars.githubusercontent.com/u/19485239?v=4" />stevenling</a> commented <strong> 4 years ago</strong> </div> <div class="markdown-body"> <p>I think can support Simplified Chinese and Traditional Chinese. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stale[bot]"><img src="https://avatars.githubusercontent.com/in/1724?v=4" />stale[bot]</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stale[bot]"><img src="https://avatars.githubusercontent.com/in/1724?v=4" />stale[bot]</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>