quran / quran_android

a quran reading application for android
http://android.quran.com
GNU General Public License v3.0
1.99k stars 886 forks source link

Improving search #453

Closed asyazwan closed 5 years ago

asyazwan commented 10 years ago

In reference to #449 and #447.

The search for Quran text currently works in mysterious way (ie. unclear). For example searching for لل doesn't include الله but will include لله.

I would like to improve this to be more intuitive as part of my academic project.

Areas to check:

I have not delved into the search code so I can't comment on the technical feasibility yet. Any advice is appreciated, especially on the UI aspect.

ahmedre commented 9 years ago

right now, search uses sqlite full text search, and just matches on search entry* (or %entry% if it's an older schema database). the problem is, sqlite doesn't really have good "text searching" things (like stemming, stop words, etc).

my guess is this work should consist of two things:

i don't think the ui needs too much at the moment - the bigger issues are the actual search issues due to the algorithm, etc.

afarra commented 9 years ago

Another point, there should be higher priority for exact match. E.g. If you search for غلا you will get 6 results with غلام before the ayah you're looking for "وَلَا تَجْعَلْ فِي قُلُوبِنَا غِلًّا لِلَّذِينَ آمَنُوا" Higher priority should be given to this exact match over the %like% one On Oct 7, 2014 4:27 AM, "ahmedre" notifications@github.com wrote:

right now, search uses sqlite full text search, and just matches on search entry* (or %entry% if it's an older schema database). the problem is, sqlite doesn't really have good "text searching" things (like stemming, stop words, etc).

  • with/without diacritics - we search without diacritics by default, but i guess if you were to enter a query with diacritics, it wouldn't match anything.
  • regex - theoretically should be doable. haven't tried it though. not sure what it will buy you though.

my guess is this work should consist of two things:

  1. figuring out what we're doing now and why it is no working great (and for what cases it doesn't work great). this logic lives mostly in DatabaseHandler.java https://github.com/quran/quran_android/blob/master/app/src/main/java/com/quran/labs/androidquran/database/DatabaseHandler.java, but some of it also lives in QuranDataProvider.java https://github.com/quran/quran_android/blob/master/app/src/main/java/com/quran/labs/androidquran/data/QuranDataProvider.java and SearchActivity.java https://github.com/quran/quran_android/blob/master/app/src/main/java/com/quran/labs/androidquran/SearchActivity.java .
  2. decide on what makes sense to do. this may either include adding new tables to search through (maybe with mapping of some expected wrong spellings of words to their correct spellings), or changing the query per whatever makes the most sense.

i don't think the ui needs too much at the moment - the bigger issues are the actual search issues due to the algorithm, etc.

— Reply to this email directly or view it on GitHub https://github.com/quran/quran_android/issues/453#issuecomment-58115892.

ahmedre commented 9 years ago

another from email: "when we search عيسئ, we find 16 results about him (Jesus). but supposed to get 25 results about Jesus."

ahmedre commented 9 years ago

another example: alakhsaroon.

asyazwan commented 9 years ago

Problem

The core problem is we are using MATCH when available (code).

Match supports only token or token-prefix queries. Taking عيس as example:

-- #1
SELECT sura, ayah, snippet(verses, '<font color="">', '</font>', '<b>...</b>', -1, 64) FROM verses WHERE text MATCH "عيس" LIMIT 150;

-- #2
SELECT sura, ayah, snippet(verses, '<font color="">', '</font>', '<b>...</b>', -1, 64) FROM verses WHERE text MATCH "عيس*" LIMIT 150;

-- #3
SELECT sura, ayah, text FROM verses WHERE text LIKE "%عيس%" LIMIT 150;
  1. Will search for exactly عيس word, ignoring all prefix and suffix. It got me 0 result, I don't know why yet.
  2. Will search for عيس with any suffix. 9 of عيس occurrences in the Qur'an contains prefix which will not be included in the search, hence again yielding only 16 results. One of such prefix example is وعيسى (in 2:136), prefix is و. Searching with prefix & suffix (ie. *عيس*) is not supported which is very weird. Workarounds on the net involve changing the table to store string reverse or string substring, both making the tables grow way too much.
  3. For unsupported schema we use LIKE which is reportedly much slower than MATCH. But it works properly by wildcarding prefix and suffix, thus getting 25 results.

Possible solutions

  1. Use LIKE all the time, losing snippet() support. IMO getting accurate search is worth losing snippet() and getting a little bit more delays when searching.
  2. Move actual searching to app code. No doubt will increase the app size considerably. Is it worth it, considering searching is not the main purpose of the app?
  3. REGEX is not including by default in SQLite. So it's out of the question. And REGEXP is not supported for Android:

    The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)".

Scoring

We need to decide on solution first before deciding a method for scoring. If it's SQL then most likely we need to do multiple queries from exact -> wildcard? Other auxiliary functions like offsets() and matchinfo() are useless in our case -- both usability-wise (fulltext only) and result-wise.

Thoughts?

ahmedre commented 9 years ago

sorry for the very late reply.

  1. we can do the functionality of snippet in code, so that's not the issue - the issue is performance. if it's practically a non-issue (i.e. the Quran data that we have is small enough such that a search is still pretty quick, even when searching something big like one of the arabic tafaseer (ex ibn kathir)), then we can go with this.
  2. probably best to avoid this if we can. less code means less bugs, etc :)
  3. yes, looks like we can't use regex.

an alternative approach (if we're just concerned about Arabic, since i guess other languages are less likely to have this problem) is what a developer mentions here, which is basically to make an index and search that particular index.

let's try option 1 on a large arabic tafseer, if it works, great, if not, we can perhaps consider this alternative option? also, if option 1 works generally well (on the Quran arabic text, for example) and is just slow on tafaseer, then maybe we can use a hybrid - i.e. use like for smaller databases and match for larger ones).

ahmedre commented 9 years ago

another solution (just for documentation purposes) is bundling our own sqlite with the app with the unicode stemmer enabled (if it works, otherwise we may need our own Arabic stemmer) - http://www.sqlite.org/android/doc/trunk/www/index.wiki.

this solution would only support apis 15+. regardless, seems like a ton of work, and given our small dataset, i'd rather us do something simpler.

maybe we bundle our own stemmer - so we write (or repurpose an existing) arabic stemmer, use it to generate quran.ar.db with stemmed words in a search field - so it will drop 'w', 'al', 'y', tashkeel, etc. we'll port this same stemmer on the java side and run it on the search query. this may end up being our best option...

some references for "bundling our own sqlite" solution: http://stackoverflow.com/questions/26642797/android-custom-sqlite-build-cannot-open-database http://stackoverflow.com/questions/6132442/android-sqlite-r-tree-how-to-install-module

ahmedre commented 9 years ago

these solutions are too much work to be worth doing imo due to the complexity of these solutions - i think that the best idea is to use the new quran.com search api when the device is connected.

@mmahalwy do you think alpha's quran search api is stable enough to include in quran android now or in the near future? we'd also need to pass in a "hint" as to what to search based on what languages/translations the user has on their device.

ahmedre commented 9 years ago

see also #427 about some examples of poor search quality.

mmahalwy commented 9 years ago

@ahmedre yeah its stable enough now. I think the improvement is the transliteration which I am waiting from you :)

ahmedre commented 9 years ago

we should also search sura names.

ahmedre commented 7 years ago

from email: search for any word in basmallah returns the first ayah of most suras

m7mdyahia commented 7 years ago

another solution (just for documentation purposes) is bundling our own sqlite with the app with the unicode stemmer enabled (if it works, otherwise we may need our own Arabic stemmer)

Unicode stemmer will not work in arabic as you expect it will not remove diacritic also it will not make equivlance between characters as (ي) and (ى) this is because unicode definittion for arabic language don't consider removing diacritic from text or matching similar letters as normalization or canonical equivalence

this could be tested using a newer version of sqlite on pc (I would appreciate if someone confirmed this)

so I think generally (for any arabic language search using fts sqlite) for arabic language we need to write sqlite arabic tokinizer then see what could be done to ship it for android (I know this is alot of work but it will be very helpfull to many applications)

sneetsher commented 6 years ago

Also #268 quiet related. I agree with @ahmedre searching Arabic is vast topic and need to many underlying tools & libraries.

The best is to relay on another ready project that provide such feature for off-line search or if there no perfect solution some on-line API. Like the one mentioned quran.com.

I would suggest that you make a generic interface that allows you to switch search provider later for an off-line solution or add another on-line provider.

@m7mdyahia There is new light Arabic stemmer. http://arabicstemmer.com/ by same main developer of Alfanous. Not if sure whether it helps.

Disclaimer: I am a user android app also contributor to Alfanous project (Unfortunately, it's Python based). Also its main developer Assem is my brother. Alfanous has a public JSON API ans provide advanced Quran search, but AFAIK integration between opensource projects does not work well sometimes.

m7mdyahia commented 6 years ago

@sneetsher great work on http://arabicstemmer.com/.

next steps would be trying to port the algorithm to sqlite stemmer

I agree with your suggestion about integrating with an existing api for know

maziio commented 6 years ago

Hi I am working on the same project and I have the same problem.I want to add another example which shows you better reasons that this match option for Arabic texts doesn't work exact. Example : SELECT rowid,doc FROM fts WHERE fts match "أعلم" results =49 (which is correct) SELECT rowid,doc FROM fts WHERE fts match "أعلم*" results =49 (which is not correct it must be 83) SELECT rowid,doc FROM fts WHERE fts.doc LIKE "%أعلم%" results =53 (which is not correct it must be 83)

ahmedre commented 5 years ago

For example:how do i search فَاحْذَرُوهُ in verse 235 in surah baqarah. The hamza after ف has got a ص at the top which is making the search not work.

ahmedre commented 5 years ago

closing because fixed.