ushahidi / SMSSync

SMS gateway for Android powered phones
http://smssync.ushahidi.com
GNU Lesser General Public License v3.0
1.14k stars 492 forks source link

regular expressions for keyword filtering #22

Closed swoertz closed 11 years ago

swoertz commented 12 years ago

it would be nice if you could filter sms messages by regular expressions

eyedol commented 12 years ago

Don't have the bandwidth to implement this feature. I'm leaving it open. Any community member who is willing to implement this feature should let me know. I'll humbly merge any pull request that comes my way on that front.

olliebennett commented 11 years ago

I am playing with implementing this at the moment. My thoughts are:

I propose an alternative, whereby entering regex(YOUR_REGEX_HERE) in the keywords field would cause the filtering to be processed via regular expressions rather than comma-separated keywords.

What are your thoughts on this? I will submit a pull request once I have this implemented and tested, so you can decide whether it's acceptable for inclusion! :)

Notes:

olliebennett commented 11 years ago

What I've got so far. All changes are within ProcessSMS.java, and there's nothing complicated.

Introduced dependency on java.utils.regex:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Changed name + function call:

Replaced...

String keywords[] = syncUrl.getKeywords().split(",");
if (filterByKeywords(messagesBody, keywords)) {

... with ...

String filterText = syncUrl.getKeywords();
if (matchesFilterText(messagesBody, filterText)) {

Then, changed the function itself by replacing...

public boolean filterByKeywords(String message, String[] keywords) {

... with ...

public boolean matchesFilterText(String message, String filterText) {

    // Handle Regular Expression, if specified by "regex(XYZ)".
    if (filterText.toLowerCase().startsWith("regex(") && filterText.toLowerCase().endsWith(")")) {
        String regEx = filterText.substring(6, filterText.length()-1);
        Pattern p = Pattern.compile(regEx); // TODO - test RegEx validity here?
        Matcher m = p.matcher(message);
        if (m.find()) {
           return true;
        }
        return false;
    }

    // Handle the filter text as comma-separated keywords
    String[] keywords = filterText.split(",");

Feel free to take this, as I don't have more time at the moment - it should work but it's not yet properly tested.

swoertz commented 11 years ago

I would suggest to make a checkbox (regular expression - true, false) instead of implementing it as a hidden feature. Persons who are not familiar either not check the box or look up what it means. All the others can simple use the feature. Further I would also suggest to make a checkbox for case-sensitivity.

olliebennett commented 11 years ago

Okay - thanks for sharing your thoughts. I agree this would be a nice addition, but my concerns remain about making the UI too cluttered, and adding complexity.

Have a look at a mock-up of these new options:

regex-ui-problem

I think you'll agree, it's getting a bit complicated, and there's no longer a nice way to give instructions on comma-separated keywords (currently, the keywords field says: "Enter keyword Eg. Fire, Crime, Help", which is clear and consise).

I'm thinking of a more useful position for the extra options. I think, at least, this feature would require a new full-page screen, that could be used when creating or editing sync URLs.

In my code above, I'm just being lazy, to be honest, and implementing the feature with the fewest changes possible! ;)

I do think your case-sensitivity option is important too. I had considered working this into my "secret" regex activation code - something like regex(YOUR_REGEX)params(CASE_INSENSITIVE) - but had ignored the idea as this just makes it very confusing. With the UI changes, it'd be a great idea. :)

Let's see what @eyedol etc. think?

swoertz commented 11 years ago

An other possibility would be, making two input fields and a checkbox for each to (de)activate (grey out the textfields) the filter mechanisms. Of course this makes the UI more complicated but in my oppinion the benefits of the additional functionality outweighs the increased complexitiy. It would also be possible to omit the checkboxes and check if the input fields are empty or not...

I would further process the message if one of the mechanisms matches (LOGICAL OR).

[x] key-word filtering
+----------------------+
| e.g. crime, fire...  |
+----------------------+
[x] regex filtering
+----------------------+
| [a-z].*              |
+----------------------+
eyedol commented 11 years ago

Perhaps we could use a label saying you can enter regex or CSV to filter for keywords instead of the checkboxes. I agree with @olliebennett the ui can be a bit messy when we add the checkboxes

swoertz commented 11 years ago

I agree with you, but we have to distinguish between keywords and regexs. The 3 ways to do that we mentioned so far are:

olliebennett commented 11 years ago

Of those three options, I think @swoertz's ASCII layout above is best.

I think the lack of case sensitivity toggling is a fair UX compromise, but there are two options for treating the input:

Which do you prefer? My use case for RegEx, would be to case-insensitively match messages that start with a keyword, for instance ^keyword.*$.

@eyedol, is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords?

swoertz commented 11 years ago

is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords?

In general there is no way, I can think of. Because every keyword is a valid regex and vice-versa if I take the regex literally I can use it as keyword. If we allow only [a-zA-Z]* for example for keywords we could detect the special characters like "^$*" and so on, but I think this is confusing.

eyedol commented 11 years ago

@olliebennett no, there is no fool-proof way of detecting whether the list is a RegEx or CSV. I think we should just make the keyword input filed process RegEx. So as a user when I enter, heat,sun,cold,yam as keywords, SMSSync should be able to match them and when I enter ^keyword.*$ it should also be able to match that as well.

olliebennett commented 11 years ago

Okay. I guess we could try use a RegEx of our own to detect whether the filter text was a valid CSV (and if not, treat as RegEx).

This (realistically) means enforcing some limits on the acceptable keywords used in the CSV list. I propose limiting these keywords to be alpha-numeric, with spaces permitted both surrounding the comma (test, keyword) and also within the keyword itself (test keyword,and another).

Are we happy to place these limits on the acceptable keywords? And what should the limits be? I think those I've listed are sufficient and reasonable.

See http://refiddle.com/by/olliebennett/csv-or-regex for a first attempt that seems to work

It would be nice to simply treat everything as a regex, but to use @eyedol's example of heat,sun,cold,yam, we'd need to convert this to (heat|sun|cold|yam) (or something like that) in order to make it have the same effect as the keywords would suggest.

If this is okay with you guys, the change should be trivial, and I'll update the code (and relevant documentation) and submit a pull request.

assuming RegEx in Java behaves in the same way as RegEx in JavaScript...

swoertz commented 11 years ago

This is a nice solution, but would not work for e.g. hashtags as keywords (example: #foo, #bar). Of course if one wants to filter by these tags (#foo)|(#bar) would work. There are other examples beside hash-tags as well. How would you document this, so that the user can recoginze the difference between keywords and reg. expressions while using the app? Many applications which are able to work with regex and "normal" strings use a checkbox (e.g. search-function in text-editors).

olliebennett commented 11 years ago

This is a tricky one!

I agree - there's a problem of letting the user know that what they think are normal keywords are in fact going to be treated as a (probably useless or even invalid) RegEx.

Perhaps we could add a confirmation box that pops up when saving the Sync URL settings, which would only appear if a special character has been entered, saying:

You have entered special characters in the filter field. This will be will now be treated as a Regular Expression. | OK |

The documentation would need to state that:

Keywords can only be made of alpha-numeric characters (and spaces). For any more advanced tags, you will need to use Regular Expressions.

There would also need to be a couple of common RegEx examples, such as "starts with X", or similar.

It's my opinion that in this case, creating extra options in the UI adds more confusion than it does value.

Another option - as an alternative to using regex(abc) - is to require the use of ^ and $ at the start and end, respectively (although that still has problems).

There's always the (inefficient) workaround of handling the RegEx matching on the server, and only taking any action if a match is found. If it doesn't match, simply return success: "true" and then both systems can move on. This of course removes the need for RegEx support altogether (but it's not a nice solution!).

at this stage, we could also check the validity of the RegEx.

swoertz commented 11 years ago

There are so many combinations which are not alphanumeric, which would lead to the message "You have entered special characters...".

The question is what is more confusing, the checkbox or the info-message when one enters "special" characters. If one is not familiar with regex's the default value of the checkbox (FALSE) is just fine. Further I think most of the smssync users are developers anyway (or is that a wrong assumption?).

How should a user who can't handle regex's react on the info message?

eyedol commented 11 years ago

@olliebennett what's the status of this? If you send me a pull request, I'm happy to merge after a code review

olliebennett commented 11 years ago

I'm actually using my own modified version with RegEx in place of keywords as standard. Adding the actual RegEx capability was trivial, however adding a configuration option proved more complicated (creating database info, UI elements, loading/saving of existing configuration etc.) so the version I use has no changes to the UI. :( Sorry

Seem to have lost the code, but can provide the apk if anyone needs - just email.

I'll have another go at implementing it sometime, perhaps.

mandric commented 11 years ago

Maybe provide a branch if it's not too much hassle?

olliebennett commented 11 years ago

I've slightly changed the way this works for me, by first checking whether the filter text matches a keyword, and if not, checking if it matches when treated as a RegEx. I know there could be issues with this approach, but can't think of any that would interfere with the typical, simple use cases.

I'm not comfortable enough with this code to throw a pull request yet, but I'll do some more tests, and consider bulking it out a bit, if needed.

Check out my feature-regex branch and see what you think. If anyone's got any input, please reply here.

eyedol commented 11 years ago

We could document on SMSSync website as, for simple filtering, use CSV as the keywords and for complex filtering, consider entering a valid RegEx in the filter by input field.

olliebennett commented 11 years ago

Agreed. The only possible problem that I foresee is if someone sends a message containing the RegEx itself. This would match as a 'keyword'.

Do you think we should change the 'enter keywords eg fire, help' placeholder text to something mentioning regex too? Or just document it online?

eyedol commented 11 years ago

Yeah, we could make the placeholder text mention RegEx too