openrightsgroup / cmp-issues

Centralised issue-tracking for the Blocked backend
2 stars 0 forks source link

Expose rules used for frontend search's "exclude adult" to admins #218

Closed alexhaydock closed 5 years ago

alexhaydock commented 5 years ago

When creating lists for this report I've identified that it would be especially helpful if admins were able to edit the rules applied to filter out expected adult content from search results when the box is checked on the frontend.

I'm going through a bunch of lists created and our current attempts at filtering are not ideal as I'm having to manually remove a large number of sites with very common patterns.

It would be helpful to have the following features:

alexhaydock commented 5 years ago

I would welcome opinions from @JimKillock on this. Perhaps a better solution to ensure we don't accidentally exclude too much (and end up falling into the same "blunt filter" problem we're complaining about) is to allow a second-level search. (i.e. Allow another search to be executed within the results of the first search).

Once a list is created from search results, a user/admin could search within the list with an option to batch-remove all entries matching certain keywords or patterns (ideally with a tickbox list before final removal as a sanity check allowing to toggle sites to be removed/not-removed), that would also work. Though I'm cautious that this sounds like more work programming-wise.

Edit: Alternatively, as a potentially easier solution - if the "Export" option for lists on the Blocked.org.uk site was amended so that the CSV included the page title as is displayed on the list page, I could download the CSV and remove unwanted entries locally and then upload it back into the system. (I would also need a feature that let me create a list by uploading a CSV in the same format).

JimKillock commented 5 years ago

I suspect the rules need to be iterated for different searches. Escorts are a common problem with localities, but might be problematic as a term if we were creating a second car car sales search.

So I can see sense in having some temporary solutions, or being able to build some options for people.

alexhaydock commented 5 years ago

I suspect the rules need to be iterated for different searches. Escorts are a common problem with localities, but might be problematic as a term if we were creating a second car car sales search.

This is a fair point. In that case, yes perhaps a temporary solution for the purposes of this report would be the most helpful. Largely I'm just looking to create lists as we have been doing already, by searching via keywords etc, but then be able to batch-remove commonly-recurring adult sites that are being caught up in the lists due to the wide keyword approach I'm using. Manually removing all of the adult sites one-by-one from a list of 3,000 photography sites is proving incredibly time consuming.

But also ideally this wouldn't be done blindly, so I'd want to be able to see at least the URL and title of each site that I'm removing from the list. Maybe the checkbox approach would be good for that.

My vision would be, if possible:

Hopefully that's not a massive amount of work, but if it is then the CSV export -> offline purge of entries -> CSV import approach might also work.

dantheta commented 5 years ago

Duplicates openrightsgroup/blocked-org-uk#136.

dantheta commented 5 years ago

The adult search terms is now configurable from the system settings menu in the control panel.

Terms can be added, enabled and disabled (enabled means that the term is excluded from searches).

dantheta commented 5 years ago

Although not officially supported or documented, it is possible to combine search terms in the keyword search, like photography* and not nude.

The backend system will automatically suffix a '*' on the search string, so as long as you plan for that (and don't use round brackets at the end of the search string) you should be able to get quite a bit of elasticsearch's capabilities.

alexhaydock commented 5 years ago

The adult search terms is now configurable from the system settings menu in the control panel.

Terms can be added, enabled and disabled (enabled means that the term is excluded from searches).

Thanks! This is working well and has sped list-generation up considerably!

I'll continue to do housekeeping on the list behind the scenes as more sites crop up since it also has the knock-on effect of providing a better user experience for new users to the main site.

alexhaydock commented 5 years ago

Small question though. What do the adult exclusion terms actually get applied to? It seems like they're being applied to the metadata associated with the site, but perhaps not the URL itself?

Take for example this fairly unambiguously adult term that doesn't really leave much room for over-filtering:

https://www.blocked.org.uk/sites/hentai?exclude_adult=1

Even though I've directly put the actual word in the filter, we still get nearly 200 results for it, many of which actually have the word directly in the URL.

dantheta commented 5 years ago

The search index is created from whole words in the metadata, for the most part. I don't think it ranks substring matches very highly (trying to avoid the scunthorpe problem).

There's also a problem with searching for a term which is on the exclude list - the query that is generated is along the lines of

hentai and (not hentai)

which doesn't give the search engine too much to work with.

The word stemming rules also don't work so well with non-english words.

alexhaydock commented 5 years ago

trying to avoid the scunthorpe problem

This makes a lot of sense. I suppose it's a big irony of the project that while compiling evidence to support our conclusions that filters are hard to do properly, I'm ending up experiencing all of the actual problems that make filters so hard to do properly.

I guess the only thing that can really be done for this is to manually sanitise lists for the report. Allowing me to customise the adult terms in the backend has already improved the experience quite a lot though, so thanks for resolving that. :)

dantheta commented 5 years ago

That's cool - thanks!

The search engine has quite a lot of power, but putting the site on the front and offering users just a single text field hides a lot of the tunables. I also don't know elasticsearch to expert level (just enough to put data in and do text queries and keep it running in < 1GB of memory), so there is probably more that can be done in the future.