kiwix / kiwix-android

Kiwix for Android
https://android.kiwix.org
GNU General Public License v3.0
859 stars 444 forks source link

Title search appears to be broken with Chinese characters (possibly all UTF-8 multibyte characters) #3587

Closed Jaifroid closed 8 months ago

Jaifroid commented 9 months ago

A user on reddit has reported that search in Chinese text is no-longer working on Android v3.8.1, both the Google Play and the APK version.

Assuming this issue can be reproduced (should be easy with a Chinese ZIM, and using search for one of the titles given by the Random button, if that works), then I would suspect that UTF-8 3-byte (most Chinese characters) and UTF-8 4-byte character codes are somehow not being catered for when reading the search field.

kelson42 commented 9 months ago

Yes, this is a duplicate of https://github.com/openzim/libzim/issues/794

wdscxsj commented 8 months ago

As a Chinese user of kiwix-android v3.9.1, I'm afraid the title search is also broken, so this issue may not be an exact duplicate.

For example, when I look up "毛泽东" (Mao Zedong in English, i.e. Chairman Mao of the PRC) in the Chinese Wikipedia (all-maxi version, 2023-09), there is no match.

If I try again character by character, the first character "毛" will trigger a long list of matches. (I suppose "毛泽东" is listed there, but not among the top dozens.) On my phone the third match is "毛一公", so let's enter "一" after "毛". This time there is no match again.

I believe this is still related to character encoding and text tokenization, as pointed out by @xiaoyifang.

Jaifroid commented 8 months ago

I corroborate this. I've done a comparison of the Android app, the PWA (on Android) and Kiwix Desktop (on Windows). I used the wikipedia_zh_history_maxi_2023-12.zim ZIM and searched for 第一次世界大戰 (First World War).

For Kiwix Desktop and the PWA on Android, we can do basic title search for this, and we get the same two results (first two images). For Kiwix Android, we get no results (third image). So this corroborates that title search is also broken for Chinese text on Android, so I'm re-opening.

Only Kiwix Desktop can do a full-text search for this term. Although it indicates at the top of the display that no results were found, in fact it displays various correct results that include snippets. The PWA uses libzim for full-text search, but is unable to do full-text search for Chinese. I suspect there is an issue with how the text is transferred to libzim, and I'll open a new issue for that in the appropriate Repo.

image

xiaoyifang commented 8 months ago

https://github.com/openzim/libzim/issues/802

I think if the zims (which contain CJK )created using libzim before 8.2.1,they should all have this issue.

wdscxsj commented 8 months ago

@xiaoyifang Thanks for the pointer! I wonder if this is related to the missing English characters issue in any way. It's still a big issue in the latest all-maxi Chinese Wikipedia zim.

Jaifroid commented 8 months ago

The ZIM I tested was created in December, whereas that PR was merged back in June. Do we know which libzim is currently being used in mwOflliner?

kelson42 commented 8 months ago

openzim/libzim#802

I think if the zims (which contain CJK )created using libzim before 8.2.1,they should all have this issue.

This is fixed, but MWoffliner, the scraper for Wikipedia still uses and old version of the libzim. Everything works fine here. We just need to complete https://github.com/openzim/mwoffliner/pull/1702

Jaifroid commented 8 months ago

But just to point out that this issue relates to title search not working on the Android app with UTF8 multibyte characters, rather than Xapian search, which is what was fixed by https://github.com/openzim/libzim/issues/802. Or maybe the Android app doesn't have title search any more (which is a shame if so, and a problem for searching any ZIM that doesn't have a Xapian index -- surely that can't be the case)?

Jaifroid commented 8 months ago

I don't want to belabour the point, but I tested title search in wikipedia_zh_medicine-app_maxi_2023-12.zim. NB This ZIM does NOT have a Xapian index. I was unable to search for 心房顫動 (Atrial fibrilation) in the Android app, whereas this term is found in other apps. @kelson42 Is this a separate issue, or should we re-open this issue, or have I misunderstood something? https://github.com/openzim/libzim/pull/806 only appears to fix Xapian-based search, from my reading of the code, but I may be wrong.

kelson42 commented 8 months ago

But just to point out that this issue relates to title search not working on the Android app with UTF8 multibyte characters, rather than Xapian search, which is what was fixed by openzim/libzim#802. Or maybe the Android app doesn't have title search any more (which is a shame if so, and a problem for searching any ZIM that doesn't have a Xapian index -- surely that can't be the case)?

@Jaifroid Our ZIM files, at Kiwix, have two title indexes, see https://wiki.openzim.org/wiki/Search_indexes. If the Xapian one is there, then it ignores the native ZIM one which is the thing to do.

kelson42 commented 8 months ago

I don't want to belabour the point, but I tested title search in wikipedia_zh_medicine-app_maxi_2023-12.zim.

No problem, but to allow to move forward we need to be very precise about what we do. For example here, you take a non-public special ZIM (only for apps) file which is not part of the one reported first. That means, by doing so, you fundamentally change the scope of the bug report and that does not really make things easier.

NB This ZIM does NOT have a Xapian index.

It does have "a Xapian index". One for the titles suggestions, but not a fulltext Xapian index. This is done on purpose because:

I was unable to search for 心房顫動 (Atrial fibrilation) in the Android app, whereas this term is found in other apps. @kelson42 Is this a separate issue, or should we re-open this issue, or have I misunderstood something? openzim/libzim#806 only appears to fix Xapian-based search, from my reading of the code, but I may be wrong.

I have already given the reason why it does not work I believe. Which other apps have you tested with? Might that be this is one which does work with the ZIM native title index?

Jaifroid commented 8 months ago

Thanks for the further explanations, they help pinpoint the potential scope of this issue. It's clearly a serious problem for Chinese users.

Which other apps have you tested with? Might that be this is one which does work with the ZIM native title index?

I tested with Kiwix Android, Kiwix Destkop (Windows) 2.3.1-2, and Kiwix PWA. The last two can do title search on the Chinese medicine ZIM. The Android app can't . I chose that ZIM to test because I thought it would narrow down the issue.

However, since it does indeed contain a Xapian non-FT index (something I was unaware of), it seems likely it should work with the Android app once the fix is in production. If not, we can revisit after.

I'm not sure I agree that ignoring binary search of the title index is good behaviour for an app. It should be the last fallback IMHO. At least for KJS apps, searching Xapian indices is very slow, so will always be secondary to binary title search unless we can speed things up.

kelson42 commented 8 months ago

tested with Kiwix Android, Kiwix Destkop (Windows) 2.3.1-2, and Kiwix PWA. The last two can do title search on the Chinese medicine ZIM.

I don't know for Kiwix Desktop 2.3.1-2, but I tested with cutting-edge version of Kiwix-Desktop (dev) and it does not work and this is normal (I just have checked because I was worried by your sentence). You should test with a ZIM made with a recent version of the libzim like https://library.kiwix.org/viewer#gutenberg_zh_all_2023-12/ ... and then things work like they should.

Jaifroid commented 8 months ago

OK, sorry to have worried you ☹️. I tried testing just now with kiwix-desktop_x86_64_2024-01-14.appimage, but it just gives me the warning No stemming for language 'zh' in console (running from a terminal in my Ubuntu WSL) when I try to search. This is probably due to something missing in WSL, and not something to worry about. So, let's leave this till next official release of Kiwix Android and re-test then.