Open simonw opened 6 years ago
Hi Simon, and thank you for sharing your article.
We used the postgresql's full-text search module in django for some projects at work and I think that striping html tags from your Entry's body is not necessary.
I tried to apply to_tsvector
on a snippet of html code from your article, before and after strip_tags
, and the resulting vectors are the same:
PostgreSQL 9.6 Schema Setup:
Query 1:
SELECT to_tsvector('<p>I''ve added <a href="https://simonwillison.net/search/">a faceted search engine</a> to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:</p><ul><li><a href="https://simonwillison.net/search/?q=postgresql">All content matching "postgresql"</a></li><li><a href="https://simonwillison.net/search/?q=django&type=quotation">Just quotations matching "django"</a></li><li><a href="https://simonwillison.net/search/?q=python+javascript&tag=mozilla&year=2007">All content matching "python" and "javascript" with the tag "mozilla" posted in 2007</a></li></ul><p>It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.</p>')
| to_tsvector |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007':70 'ad':3 'also':72 'appli':89 'back':86 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':104 'count':75 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,74 'feel':98 'filter':26,29,32,93 'general':97 'get':85,95 'javascript':63 'like':24 'mani':81 'match':52,56,60 'mozilla':67 'navig':107 'one':90 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':73 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':82 'search':6,18,20,49 'shape':101 'sql':23 'support':15 'tag':28,66 'tell':79 'text':17 'type':35 've':2 'vs':37,39 |
Query 2:
SELECT to_tsvector('I''ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:All content matching "postgresql"Just quotations matching "django"All content matching "python" and "javascript" with the tag "mozilla" posted in 2007It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.')
| to_tsvector |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007it':70 'ad':3 'also':71 'appli':88 'back':85 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':103 'count':74 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,73 'feel':97 'filter':26,29,32,92 'general':96 'get':84,94 'javascript':63 'like':24 'mani':80 'match':52,56,60 'mozilla':67 'navig':106 'one':89 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':72 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':81 'search':6,18,20,49 'shape':100 'sql':23 'support':15 'tag':28,66 'tell':78 'text':17 'type':35 've':2 'vs':37,39 |
You're absolutely right! Thanks, that's really interesting. I'll update the entry to reflect this.
Added this to the post:
Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing
SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>')
using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.
I'm leaving the regexp_replace stuff in there though as it's a nice demonstration of how to run regular expressions inside of PostgreSQL, even though they aren't actually needed here.
Thank you for mentioning me.
I think you have done well to leave the regexp_replace stuff, because it is very interesting.
I think I will use some of the ideas contained in your article in my upcoming talk about full-text search in django with postgresql, obviously quoting you.
Comments on https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/