hotosm / tasking-manager

Tasking Manager - The tool to team up for mapping in OpenStreetMap
https://wiki.openstreetmap.org/wiki/Tasking_Manager
BSD 2-Clause "Simplified" License
500 stars 270 forks source link

Project Search Full-Text Partial Word Auto-complete #1850

Open thadk opened 5 years ago

thadk commented 5 years ago

The new TM 4.0 design calls for an auto-complete project search field on the "Explore Projects" page. After research, this is commonly implemented using an Elastic Search instance which is not part of TM. A few people have tried to implement it using only PostgreSQL but there are manual steps to increase its usefulness.

image

The PostgreSQL GIN index, which is already set up for the project search, likely needs a new API created to enable the autocomplete in the design spec as it is shown. But there are some choices.

There are two things the autocomplete API could return using PostgreSQL:

  1. query suggestions: suggested words like "Malaria" or "Mozambique" as show in the design mockup. Query suggestions are made up of ngrams. Ngrams are are small sets of words together.
  2. actual projects results: actual project titles matching some of the words or partial word typed so far

There is a third option. It is mostly client-side and technically simpler. We could offer autocomplete for all titles of projects in the active search. We could add the project titles to the project/search API for all pages of the results, along with the id and map location. For 3000 projects, this should be easy to handle on the client side. We could still allow the user to ignore the autocomplete and use the full-text search too.

The fourth option is to leave the text search box as-is and/or to just make the existing project/search search support partial word prefix matching. It could have no new autocomplete popup since it already live-updates shown project cards on typing. It would be a bit slower.

Currently when the project/search is given a text query, the API only responds with projects that contain the matching whole words from the index. Instead, this new API should respond given incomplete words with ngrams or projects the GIN full text search index.

“prefix matching” using a syntax such as to_tsquery('humanit:*') as below for the test & ar... search. It can allow us to do an example search of "ssa ar" to find the results with "SSA" or "arbitrary" in my test database. (note: the main project/search currently replaces spaces with | operator): SELECT * FROM project_info WHERE text_searchable @@ (to_tsquery('ssa | ar:*')) = true;

image

For 1) my research indicates we would need to build an ngram table from the document text which we would return the query suggestions from, based on frequency of the ngrams. We might update it on inserts. I haven't seen any examples of re-using the tsvector data we have in project_info.text_searchable. Here is a in-depth guide with pseudocode: https://towardsdatascience.com/implementing-auto-complete-with-postgres-and-python-e03d34824079

For 2) I think we can mostly use the code we have and just simplify the query from project/search to make it faster with a new endpoint and add the prefix matching feature above. I don't think it would be as useful as query suggestions. This elastic search partial reimplementation might help if the simple prefix matching doesn't work: https://stackoverflow.com/questions/56894979/edge-ngram-search-in-postgresql

I think the main thing is that we not over-engineer this. Based on the research, it could be a time sink to implement this. With only about 3,000 projects, there is not really much text to search. A basic solution probably solves it.

Other resources

bopercival-hot commented 3 years ago

closed as per #1859

Aadesh-Baral commented 2 years ago

As #1859 was not merged this issue is still open. There are too many conflicts on #1859 to merge so need some rework on this.