Add a basic tokenizer for log messages

willb commented 8 years ago

@erikerlandson if you can give this a quick look I'll merge after your lgtm (thanks!)

erikerlandson commented 8 years ago

My comments below notwithstanding, this LGTM

A few comments:

Wondering why you separately define leading/trailing/rejected punctuation, since it is all just 'characters that are stripped'
Not necessary, but might be nice to support user configured punctuation to reject, provided it is clean to do and not too much work
I'd recommend that, scaladoc for tokens should have some short end-to-end description of what happens. Something like: "Log message is split into tokens separated by whitespace, anything that isn't alpha-num plus _ plus - is stripped, then function post is applied to each token, then any token not containing at least one letter is filtered out, then pred is applied as a final filter"

erikerlandson commented 8 years ago

Oh, one more: might be good to add unit test that exercises the various filters all at the same time. e.g. tokens that include both punctuation that should be stripped and that should be kept. Technically, should probably also unit test post and pred

erikerlandson commented 8 years ago

LGTM!

willb commented 8 years ago

@erikerlandson Thanks!

radanalyticsio / silex

Add a basic tokenizer for log messages #39