mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Use localised stopwords in xapian #1575

Open crowbot opened 10 years ago

crowbot commented 10 years ago

Currently a couple of English stopwords are hardcoded https://github.com/mysociety/alaveteli/blob/rails-3-develop/lib/acts_as_xapian/acts_as_xapian.rb#L165. We could allow for or provide language-specific stopword lists e.g. https://github.com/johnl/xapian-fu/tree/master/lib/xapian_fu/stopwords

garethrees commented 9 years ago

Could do with improving the English stopwords too. Lots getting included in the term lists which look useless:

Term List for record #82314: Bd_speers C Fmid_staffordshire_nhs_foundation_trust IInfoRequestEvent-72030 Kfollowup_sent Lsuccessful MInfoRequestEvent Raudit_of_accounts_2 Swaiting_response T Vsent Wno Za Zaccount Zadvis Zas Zaudit Zbeliev Zcan Zcarri Zcommiss Zd Zfaith Zfoundat Zhas Zhospit Zi Zin Zis Zlonger Zno Znow Zof Zout Zplace Zpleas Zprocedur Zprocess Zregard Zrole Zspeer Zstatus Zthank Zthe Zthis Ztrust Zwhat Zwho Zwith Zyou Zyour a accounts advise as audit auditing believe can carries commission d faithfully foundation has hospitals i in is longer no now of out place please procedure process regarding role speers status thanks the this trust what who with you yours

"i, in, is, no, now, of, out, who, with, you, yours" etc

Might well be contributing to #1179, and possibly #2137

garethrees commented 9 years ago

Ah actually stopwords just strip out the words from the query. I think the Term list should be generated as is.

RichardTaylor commented 2 years ago

A WhatDoTheyKnow user wrote to let us know about a case where it might have been useful if "for" was stripped from their search term.

Their search was similar to:

https://www.whatdotheyknow.com/search/DNA%20for%20DVLA/all

all the snippets shown with the results there highlight "for" rather than the acronyms actually being searched for.