sheltermanager / asm3

Animal Shelter Manager
GNU General Public License v3.0
106 stars 65 forks source link

Fulltext search #107

Open bobintetley opened 7 years ago

bobintetley commented 7 years ago

At the moment, many fields such as comments, hidden comments, etc are not searched in larger databases.

Add a new "searchindex" table that contains

tablename varchar linkid integer words text

Unique composite index on tablename and linkid.

Words could be potentially very large and indexing text fields is problematic anyway. This table is effectively an index, so a full table scan is going to be expected anyway.

We don't keep occurrence counts as ASM is not searching documents - more occurrences of the search result does not mean more relevant in our case. Instead, use the order in which words are in the list. The closer to the front of the list, the more relevant the result of a match is. For an animal record, the index would be built with the code and name fields inserted first, so a match to an early word like this would constitute higher relevance when building results.

Saving a record to a bottom level table like animal, owner, animalwaitinglist, animalcontrol, animallost, animalfound, etc creates an entry in the fulltext table. Note that we do search some second level tables, like ownerlicence.

Have a function in search.py that handles creating these entries from insert/update of all relevant text fields on a record (it accepts a tablename, id and a list of fields in order in which they should be indexed). It should handle deleting existing entries first and accept a list of field contents to extract words from in order of importance. It removes common English noise words, deduplicates the word list, strips any punctuation and lower case everything.

The simple find routines should switch to use fulltext (but only search their base table). The search module does all the heavy lifting.

bobintetley commented 6 years ago

Another consideration for this module could be implementing sort collations. One Spanish user is reporting that they have English speakers and would like á to be equivalent to a. The routine that builds the words could copy words with accented characters and turn them into plain ascii.

bobintetley commented 5 years ago

something else to consider is phone number fields - they often have extraneous chars like # ( ) and - that could be removed prior to addition to the index (and similarly can be stripped from the term by the search tool).

bobintetley commented 4 years ago

we could phase the rollout of this - enable and write the code to start constructing the index but don't switch the search code to use the index until we have done some performance testing and ironed out any bugs.

bobintetley commented 4 years ago

More to consider, we support many languages other than English. We would need "noise" word lists for many other languages. This also needs more of a rethink as a table scan against a text field for words is going to be slow. Ideally, the index should be upside down and based on words->rows which is difficult to implement cleanly in a relational database. It might be worth considering some kind of very fast to parse file system based index for speed (use chunks and seek)- perhaps one file per table.

This needs more thought.

bobintetley commented 4 years ago

This is an awful lot of running to basically stand still. The simple search module as it stands is "good enough" in terms of performance and functionality, and it operates without a penalty when adding records and does not require a huge conversion of existing databases.

I'm removing the milestone from this as I see no reason to spend development time on this in the near future.

saggineumann commented 4 years ago

@bobintetley have you considered using ElasticSearch for fulltext search? ES already has built in support for many languages, stop words, parsing and indexing of numbers (phone numbers for example) and there are good ways to integrate it with databases so that you wouldn't have to feed it data from within the app but rather just use it in search.

From the tests I run with the nonprofit I'm trying to implement ASM for, it seems like searching comments is important

bobintetley commented 4 years ago

Hi, thanks for your comments. I've looked at Lucene in the past. I didn't want to add more deployment/dependency complexity and another technology (Java) to the stack. I also realised that there's a lot of custom and implicit behaviours in the existing search that would be lost.

Searching specific comment fields in larger databases can be done via the advanced find screen.

Both MySQL and PostgreSQL do support their own form of fulltext indexing - that could be an area to investigate and would not require new dependencies. We have database specific subclasses so it would be neat to implement.

saggineumann commented 4 years ago

I think that ElasticSearch (which is simpler to use than Lucene) has a few benefits that aren't available in MySQL/Postgres FTS:

It doesn't require to add more tech to the stack (i.e., you can query it from within Python easily) besides the actual setup of ES and if FTS is not currently used in search, I believe that there would be behaviour changes if we switch to them too (without the pros mentioned above). Where in the code is search handled? I'll try to investigate further.