Switching to upstream-style filters

ClearlyClaire commented 6 years ago

Upstream has recently introduced1 custom filters. Those can work both client-side or server-side.

We can mostly translate our existing filters to upstream ones, but not completely: there is no whole_word switch upstream, everything is matched without the whole-word thing.

I'm not too sure how to proceed. I think I'd want to take upstream's implementation but add the whole-word switch to not loose functionality? In this case, the merge commit will edit the DB migration script to migrate the existing glitch-only filters. Does that sound ok?

hannahwhy commented 6 years ago

Hmm, maybe. Tootsuite's implementation does some things I'm not too big on:

It uses a single regex to match; ours uses individual terms, which makes it possible to do things like count matches, provide the last time a keyword mute matched, and do different actions per-mute (e.g. drop from timeline, auto-CW)
AFAICT it runs on the status content, but not explicitly on hashtags
The tootsuite implementation checks the status text without stripping HTML; ours strips HTML while keeping things like URL hrefs, which is handy for filtering on URL components
The spec suite for tootsuite's implementation has only one addition in FeedManager's spec; ours covers many more possibilities

I admit the extra features afforded by the glitchsoc data structure are not implemented, but I don't want to close the door on them.

For these reasons, I'd prefer to keep the glitchsoc implementation, extending it to cover what upstream's does. (I am, however, also biased, since I wrote the thing)

ClearlyClaire commented 6 years ago

Hm, those are fair points, but I think at least some of those (e.g. striping HTML or having more specs) would benefit upstream as well, and I'd prefer not deviating too much from upstream if possible.

Afaict, the things upstream does differently is have more scopes (not only notifications and everything else, but different scopes for Home, Public, and threads), as well as having client-side filtering. We could extend the existing glitch-soc filtering to that, but we would have to modify both flavors to avoid discrepancies between client-side and server-side filtering (striping HTML etc.)

hannahwhy commented 6 years ago

Yeah, I think it'd be fine to PR the differences in the glitch-soc implementation, though it's something I haven't looked much into just yet.

(The two implementations should be independent except for where they both have to hook into FeedManager, so while it is funky to have two implementations of the same thing in a codebase, it should be possible to only use one of them and not be negatively affected by the presence of the other one. It would be nice to eventually have one implementation, of course, but this shouldn't block a merge with upstream.)

ClearlyClaire commented 6 years ago

It's not that easy. If we just select one in feed_manager, we will still have both showing up in the settings, which will be extremely confusing to users. If we chose upstream's, then everything else is fine. If we chose ours, not so much in the vanilla UI. We could use both which is slightly less confusing, but eh…

hannahwhy commented 6 years ago

Oh, right, I forgot about the multiple-flavour thing.

It might be best to just take upstream's implementation. I would like to continue work on the glitchsoc approach, but I've not been motivated to do so -- it didn't seem like very many people (if anyone) was using it. However, I can do that in a private branch.

(Feature flags would be ideal for this situation, but we do not have a feature flag framework in place. For that reason, if this approach is taken, I ask that the glitchsoc keyword mute code remain in place, with the UI hooks for it removed or dummied out. That'd minimize the amount of work I have to do to bring them online again.)

ClearlyClaire commented 6 years ago

Yeah, taking upstream's implementation would be easiest, but we will lose whole-word matching or have to modify both the backend and the front-ends for this to work… as for the other discrepencies, I'm wondering: what good does it do to filter explicitly on hashtags? The only hashtags displayed by Mastodon are those appearing in the contents of the toot.

hannahwhy commented 6 years ago

Hashtags are a separate entity in Mastodon's data model, so it made sense to me to explicitly filter on those. It seems to currently be the case that Status#tags is populated from status content via ProcessHashtagsService, but I didn't want to rely on that connection.

nightpool commented 6 years ago

remote posts may have hashtags that don't show up in the status content, but i think it would be confusing to users if posts were hidden for that reason (it would be impossible to tell in the mastodon front end why it was happening, for example)

ClearlyClaire commented 6 years ago

Here is what I believe glitch-soc does which would not be preserved by switching to upstream's implementation:

Allow a whole-word ~~~case-insensitive~~~ match mode
Strip HTML before searching, using Markdown notation for links
Match hashtags separately
Single VS multiple regexps

As of now, I consider 4. an implementation detail. 2 and 3 are more tricky, I'm open to adopt upstream's behavior but I'm not completely sure about the implications. I think this won't change anything for “reasonable” phrases and posts. But remote posts could use markup to mess with filtering I guess.

Losing 1 would be a real regression, though. That's easy to fix server-side, but it would need to be implemented client-side too for consistency. I'm not sure how easy that is, and it would require changes in the vanilla flavor as well.

Gargron commented 6 years ago

FeedManager is a performance hotspot so the less work you have to do in it, the better, which is why I am combining everything into one regex instead of multiple. (On the client it doesn't matter but consistency is not bad)

Not stripping HTML is very likely a problem in terms of people accidentally matching span tags or br tags or something like that they shouldn't be thinking about, but again, stripping HTML adds perf overhead. (But might be unavoidable)

Some people have asked for a regex option (where they enter the regex themselves) but I am hesitant about actually allowing people to specify a regex to run on the server-side, because you can do a lot of harm with it with super expensive matchers.

However, a whole-word mode sounds nice, although I was wondering if it wouldn't be better to have it as the only mode because do people want to match "ass" in "assassin" pretty much ever?

ClearlyClaire commented 6 years ago

@Gargron Yeah, regexps were also considered but dropped for glitch-soc (#235). As for whole-word matching, hm… there is the case of hashtags for instance, but yeah, non-whole-word matching would probably match a lot more things. ~~~In glitch-soc, whole-word matching is also case-insensitive, but that may be an oversight.~~~

EDIT: Actually misread the code as far as case-sensivity is concerned.

hannahwhy commented 6 years ago

One place non-whole-word matching comes in handy is to match e.g. "pol" in "uspol"; I've used that a few times.

I haven't measured the overhead of word matching; that's something I've been meaning to do (but see above for why I haven't). Keyword matching does need to be fast and lean, but without the numbers my plan was to implement everything I wanted first (counters, per-term behaviors) and then see what was most expensive, as that is usually surprising.

Keyword matching (both whole-word and non-whole-word) is intended to be case-insensitive in glitch-soc; there's an example that demonstrates that (spec/modules/glitch/keyword_mute_spec.rb:62).

nightpool commented 6 years ago

per-word behaviors doesn't conflict with compiling into a single regex for performance. You just do a secondary match to determine why the first match was triggered. it's orders of magnitude less expensive

On Thu, Jul 5, 2018 at 5:42 PM David Yip notifications@github.com wrote:

One place non-whole-word matching comes in handy is to match e.g. "pol" in "uspol"; I've used that a few times.

I haven't measured the overhead of word matching; that's something I've been meaning to do (but see above for why I haven't). Keyword matching does need to be fast and lean, but without the numbers my plan was to implement everything I wanted first (counters, per-term behaviors) and then see what was most expensive, as that is usually surprising.

Keyword matching (both whole-word and non-whole-word) is intended to be case-insensitive in glitch-soc; there's an example that demonstrates that ( spec/modules/glitch/keyword_mute_spec.rb:62).

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/glitch-soc/mastodon/issues/555#issuecomment-402861132, or mute the thread https://github.com/notifications/unsubscribe-auth/AAORV74jW5iR8zCmJHdpdKemYVdhdinDks5uDog_gaJpZM4VEQHz .

hannahwhy commented 6 years ago

Could be. The only point I was making was that I haven't measured anything.

ClearlyClaire commented 6 years ago

You're right about everything being case-sensitive in glitch-soc, I misread it. So, my plan is to add a whole-word matching option upstream, then merge upstream and migrate existing keyword filters to upstream's feature.

There will still be some behavior changes (hashtags, HTML stripping) with the previous glitch-soc implementation, but I think they are pretty minor. Everything else should be easily migrated.

Are you ok with this plan, @yipdw?

EDIT: In the meantime, upstream has been updated to be whole-word matching only… so this migration plan isn't really applicable… maybe add the option to not match whole-word then…?

Gargron commented 6 years ago

I want to understand, is there a real use for not matching whole words? Seems to me like it would lead to a lot of false positives. I understand matching "pol" in "uspol" but you would also match poland, polarbears, polarity, etc, I don't think a mistake like that should be easy to make using the filters. Seems like we should encourage people to just explicitly mute "uspol" instead.

hannahwhy commented 6 years ago

Well, it worked for me.

The impact of a false positive can be mitigated when experimentation is cheap, i.e.

the interface displays the input that triggered filtering
mutes can be quickly added and reverted

This kicks off a feedback loop in which someone can fine-tune their filter settings to achieve the results they want. For this to be practical, the feedback loop must operate on an interactive timescale; this may not be possible now but it seems like it ought to be.

I gather it's important that this move forward. I have a private branch with my original mute code; I don't know if I'll get back to it, but the model does interest me. Given that, I guess it's okay for the glitch-soc code to be deleted.

hannahwhy commented 6 years ago

For a concrete answer to non-whole-word stuff:

As far as I know, I'm the only one who ever used glitch-soc's mutes. (There may be others, but I never found them.) Given that, the feature had utility, but only for one person; therefore, if it turns out it's easier to just drop the option, I guess that's the way it'll have to go.

Gargron commented 6 years ago

17 people are using filters on m.s., it's a lot less popular than I anticipated given how requested the feature seemed to be. (Also I'm noticing people are confusing it with domain blocks, uhh, I don't know how to deal with that one)

ClearlyClaire commented 6 years ago

There is something else I'm worried about: upstream has decided to completely delete the old regexp-based filtering thing, and that's understandable as the new feature has approximately the same scope and is much easier to use. However, as far as I can tell, upstream did not give users a chance to review their existing filters, and I would want to do that for glitch-soc (although I have no idea how).

hannahwhy commented 6 years ago

I guess the existing regex filters could be left alone (I don't think they need to be removed?) and a deprecation notice could be sent out, e.g. as a local-only toot.

(The existing regex filters can be retrieved via Web::Setting, but I don't know how much good that would do)

ClearlyClaire commented 6 years ago

They would clutter the interface for no good reason. That being said, we can postpone their deletion to a later date, giving us time to think of a better migration path for them.

In the meantime, I have written the migration scripts, so I'm almost ready for a merge (the glitch-soc flavour won't support client-side filtering just yet though, but I think I can implement that pretty quickly).

hannahwhy commented 6 years ago

Maintaining user expectations seems like a good enough reason to keep the regex filters around; automatic conversion would probably work for some regexes, but certainly not all.

ClearlyClaire commented 6 years ago

This has been fixed in 402da8065c2b378cca6361f2c7252bd766f25dde (and follow-up commits)

glitch-soc / mastodon

Switching to upstream-style filters #555