simonw / simonwillisonblog

The source code behind my blog
https://simonwillison.net/
Apache License 2.0
188 stars 17 forks source link

Comments on "Implementing faceted search with Django and PostgreSQL" #1

Open simonw opened 6 years ago

simonw commented 6 years ago

Comments on https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/

pauloxnet commented 6 years ago

Hi Simon, and thank you for sharing your article.

We used the postgresql's full-text search module in django for some projects at work and I think that striping html tags from your Entry's body is not necessary.

I tried to apply to_tsvector on a snippet of html code from your article, before and after strip_tags, and the resulting vectors are the same:

SQL Fiddle

PostgreSQL 9.6 Schema Setup:

Query 1:

SELECT to_tsvector('<p>I''ve added <a href="https://simonwillison.net/search/">a faceted search engine</a> to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:</p><ul><li><a href="https://simonwillison.net/search/?q=postgresql">All content matching "postgresql"</a></li><li><a href="https://simonwillison.net/search/?q=django&amp;type=quotation">Just quotations matching "django"</a></li><li><a href="https://simonwillison.net/search/?q=python+javascript&amp;tag=mozilla&amp;year=2007">All content matching "python" and "javascript" with the tag "mozilla" posted in 2007</a></li></ul><p>It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.</p>')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        to_tsvector |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007':70 'ad':3 'also':72 'appli':89 'back':86 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':104 'count':75 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,74 'feel':98 'filter':26,29,32,93 'general':97 'get':85,95 'javascript':63 'like':24 'mani':81 'match':52,56,60 'mozilla':67 'navig':107 'one':90 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':73 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':82 'search':6,18,20,49 'shape':101 'sql':23 'support':15 'tag':28,66 'tell':79 'text':17 'type':35 've':2 'vs':37,39 |

Query 2:

SELECT to_tsvector('I''ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:All content matching "postgresql"Just quotations matching "django"All content matching "python" and "javascript" with the tag "mozilla" posted in 2007It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          to_tsvector |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007it':70 'ad':3 'also':71 'appli':88 'back':85 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':103 'count':74 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,73 'feel':97 'filter':26,29,32,92 'general':96 'get':84,94 'javascript':63 'like':24 'mani':80 'match':52,56,60 'mozilla':67 'navig':106 'one':89 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':72 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':81 'search':6,18,20,49 'shape':100 'sql':23 'support':15 'tag':28,66 'tell':78 'text':17 'type':35 've':2 'vs':37,39 |
simonw commented 6 years ago

You're absolutely right! Thanks, that's really interesting. I'll update the entry to reflect this.

simonw commented 6 years ago

Added this to the post:

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

I'm leaving the regexp_replace stuff in there though as it's a nice demonstration of how to run regular expressions inside of PostgreSQL, even though they aren't actually needed here.

pauloxnet commented 6 years ago

Thank you for mentioning me.

I think you have done well to leave the regexp_replace stuff, because it is very interesting.

I think I will use some of the ideas contained in your article in my upcoming talk about full-text search in django with postgresql, obviously quoting you.