itkach / aard2-android

Aard2 for Android, a simple dictionary app
GNU General Public License v3.0
425 stars 98 forks source link

Search is case insensitive #168

Closed 6801318d8d closed 8 months ago

6801318d8d commented 8 months ago

Wiktionary has 2 entries for "lanky": one lowercase (https://en.m.wiktionary.org/wiki/lanky), which means "tall, slim".

One capital case (https://en.m.wiktionary.org/wiki/Lanky), which means "the dialect of English spoken in Lacanshire".

In Aard2 it is impossible to access the lower case version.

itkach commented 8 months ago

In Aard2 it is impossible to access the lower case version.

It is possible. Both "lanky" and "Lanky" are in the result list, closer match first (if lookup was "Lanky" then it's first and "lanky" is further down the list, if you type "lanky" then that's first but "Lanky" is also in the list). Access desired article either by selecting entry from the lookup result list or select any entry and then swipe articles to left or right to navigate the result list.

6801318d8d commented 8 months ago

It is possible. Both "lanky" and "Lanky" are in the result list

No they aren't

Screenshot_20231010_175145_Aard 2 Screenshot_20231010_175156_Aard 2

itkach commented 8 months ago

No they aren't

They are for me with enwiktionary-NS0-20220420-ENTERPRISE-HTML.slob. Which enwiktionary do you have? There were reports at https://groups.google.com/g/aarddict recently that number of entries in Wikipedia's enterprise dumps decreased for some reason, perhaps "lanky" is absent from enwiktionary version you have?

IMAGE 2023-10-10 13:42:49

IMAGE 2023-10-10 13:41:18

6801318d8d commented 8 months ago

No they aren't

Which enwiktionary do you have?

enwiktionary-20230921.slob

itkach commented 8 months ago

enwiktionary-20230921.slob

this one is ~515M and has 2 404 348 items, mine is 2.6G and has 7 103 761 (download). Comparing the 20230920 enterpise html dump to one from 20220420 I see the new one is much smaller - 2.7G instead of 9.7 before. @MHBraun I know such dramatic reduction was noted before for some other wikis and wiktionaries, but perhaps this is the case for all the enterprise dumps now.

6801318d8d commented 8 months ago

enwiktionary-20230921.slob

this one is ~515M and has 2 404 348 items, mine is 2.6G and has 7 103 761. Comparing the 20230920 enterpise html dump to one from 20220420 I see the new one is much smaller - 2.7G instead of 9.7 before. @MHBraun I know such dramatic reduction was noted before for some other wikis and wiktionaries, but perhaps this is the case for all the enterprise dumps now.

so? Did I download the wrong file? I don't understand

itkach commented 8 months ago

so?

so the issue is with the dictionary file itself, not aard2-android application

Did I download the wrong file?

It's not "wrong" in that it is made from the latest data Wikimedia provided and it's a valid dictionary. But for some reason latest Wikimedia data dump is missing A LOT of entries, so perhaps it is not as good as the older versions of the dictionary - if you are looking for the most complete data set. I don't know if Wikimedia pruned the dumps on purpose to reduce the size at the expense of articles they deem not good or important enough or if it is a technical issue at Wikimedia. Perhaps a technical issue, this looks relevant: https://phabricator.wikimedia.org/T305407

itkach commented 8 months ago

@6801318d8d try this one: https://dl.aarddict.org/enwiktionary-NS0-20220420-ENTERPRISE-HTML.slob

MHBraun commented 8 months ago

This is correct. For some reason for a lot of NS0 dumps since 20230701 there are a lot of missing articles. The version of 20230601 is fine for all languages. For some reason the versions 20231001 seem to be complete. This applies for wikis only. I had to fully scrape the wiktionaries (going back to the old method) in order to generate wiktionaries which have full data set. Hence 202310xx slob files should be fine. Please see as well https://groups.google.com/g/aarddict/c/FtETxLIAeV8

I will update on https://groups.google.com/g/aarddict/ with further findings

6801318d8d commented 8 months ago

Let's make order, shall we?

  1. For some reason for a lot of NS0 dumps since 20230701 there are a lot of missing articles.

So not all dumps since 20230701 are broken, just some / a lot of them. Did we cherry pick the broken ones and report them as a bug? How they differ from the non-broken ones?

2.

For some reason the versions 20231001 seem to be complete. This applies for wikis only.

So some/a lot of wikipedias and wiktionary dumps >= 20230701 are broken? Wikipedia dumps were fixed with >= 20231001, while wiktionary dumps are still broken?

3.

I had to fully scrape the wiktionaries (going back to the old method) in order to generate wiktionaries which have full data set. Hence 202310xx slob files should be fine.

So we are not using dumps anymore. We are webscraping and thus versions >= 202310 are fine, right?

MHBraun commented 8 months ago

Let's make order, shall we?

What do you want to order?

For some reason for a lot of NS0 dumps since 20230701 there are a lot of missing articles.

So not all dumps since 20230701 are broken, just some / a lot of them. Did we cherry pick the broken ones and report them as a bug? How they differ from the non-broken ones?

The broken ones have less data. :) The problem is known. See phabricator link above

For some reason the versions 20231001 seem to be complete. This applies for wikis only.

So some/a lot of wikipedias and wiktionary dumps >= 20230701 are broken? Wikipedia dumps were fixed with >= 20231001, while wiktionary dumps are still broken?

see https://groups.google.com/g/aarddict The amount of articles in wikipedia dumps seem to be correct. The amount of articles in wiktionary dumps are too low.

I had to fully scrape the wiktionaries (going back to the old method) in order to generate wiktionaries which have full data set. Hence 202310xx slob files should be fine.

So we are not using dumps anymore. We are webscraping and thus versions >= 202310 are fine, right?

Correct. I am webscraping the wiktionaries to get all the data now. This is correct. It does not make sense to generate slob files which are incomplete. If all 202310 wiktionaries are fine is not clear. The process is not finished yet. It takes more than a week to generate those. However the ones which are finished, seem to be good.

6801318d8d commented 8 months ago

I have downloaded enwiktionary-20231014.slob. If I search for "lanky" only the capital case word is shown, not the lower case one.

So exact same bug also with this dictionary.

MHBraun commented 8 months ago

Can confirm. So just stay with enwiktionary-20230601.slob

MHBraun commented 7 months ago

Problem solved. enwiktionary-20231027.slob and onwards should be fine. All 7,5 mio articles are included. lanky as well :) I went away from the dumps and scraped the wiktionary directly. This is massively time consuming as it takes three weeks to get the data but gives correct results.

itkach commented 7 months ago

@MHBraun thank you!

On Sun, Nov 5, 2023, 05:43 MHBraun @.***> wrote:

Problem solved. enwiktionary-20231027.slob and onwards should be fine. All 7,5 mio articles are included. lanky as well :) I went away from the dumps and scraped the wiktionary directly. This is massively time consuming as it takes three weeks to get the data but gives correct results.

— Reply to this email directly, view it on GitHub https://github.com/itkach/aard2-android/issues/168#issuecomment-1793700397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABME4KIHEI5OZALOV7N3FTYC5UWRAVCNFSM6AAAAAA52EQKRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTG4YDAMZZG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

6801318d8d commented 7 months ago

Problem solved. enwiktionary-20231027.slob and onwards should be fine. All 7,5 mio articles are included. lanky as well :) I went away from the dumps and scraped the wiktionary directly. This is massively time consuming as it takes three weeks to get the data but gives correct results.

Thanks you, any news on the underlying bug in the dumps?

MHBraun commented 7 months ago

Unfortunately not.

It is not clear if the issue of missing articles on phabricator.wikimedia is handeled at all.

The latest wiktionary dumps are still missing a substantial amount of articles. 

Not usable to generate a wiktionary.slob


From: 6801318d8d @.***> Sent: Sunday, November 5, 2023 18:51 To: itkach/aard2-android Cc: MHBraun; Mention Subject: Re: [itkach/aard2-android] Search is case insensitive (Issue #168)

Problem solved. enwiktionary-20231027.slob and onwards should be fine. All 7,5 mio articles are included. lanky as well :) I went away from the dumps and scraped the wiktionary directly. This is massively time consuming as it takes three weeks to get the data but gives correct results.

Thanks you, any news on the underlying bug in the dumps? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>