newsboat / newsboat

An RSS/Atom feed reader for text terminals
https://newsboat.org/
MIT License
3k stars 215 forks source link

Port FilterParser and Matcher to Rust #631

Open Minoru opened 5 years ago

Minoru commented 5 years ago

Newsboat can show and hide items that match certain expressions, like author == "Mark Twain" or title =~ "C++". These expressions are parsed with FilterParser class, and Matcher performs the check.

FilterParser is currently implemented using Coco/R. The syntax of filter expressions is described by filter/filter.atg (docs for the format are in section 1.2 of Coco's user manual). I believe it can be re-implemented with nom crate, which we already use for a number of tasks; see rust/libnewsboat/src/fmtstrformatter/parser.rs for inspiration.

Note that FilterParser's test coverage is a bit incomplete; I'd like to see Parser::Errors::SynErr() tested as well, because we probably use it through FilterParser::get_error() (but you should check and prove me wrong if you can). The tests are in test/filterparser.cpp.

Normally I'd ask to port one class at a time, but in this case it seems summarily faster and easier to simultaneously port Matcher as well. Matcher walks the tree built by FilterParser, and even though it's possible to do this over FFI, it requires writing a lot of bindings; for each of six node types, we'll need two functions. If we port Matcher as well, we can use Matchable over FFI, which is only two functions.

Matcher needs one more test, to trigger this error condition.

If all of the above doesn't sound intimidating enough, here's one more constraint: we need to keep compatibility with current regex syntax, i.e. title =~ "regex" should keep working. I've no idea what syntax we use at the moment, though.

Rough TODO for the implementer (comment here if something's unclear or missing):

tsipinakis commented 5 years ago

we need to keep compatibility with current regex syntax, i.e. title =~ "regex" should keep working. I've no idea what syntax we use at the moment, though.

We're using the POSIX extended regex, that's inconvenient because it looks like rusts regex crate syntax is: "similar to Perl-style regular expressions [...] Much of the syntax and implementation is inspired by RE2.".

I couldn't find anything use-able in rust that implements posix regex. I guess that leaves us to make our own wrapper around reg{comp,exec,error,free}.

Minoru commented 4 years ago

Spent about an hour fiddling with a test for Parser::Errors::SynErr(). My takeway is that it's pointless. A few errors can be triggered in an obvious way, e.g. "closeblock expected" is triggered by (a = "a". Other errors seem to be less obviously connected to the input: e.g. a = "a" xor results in "EOF expected", not "invalid logop" as one might expect. Other errors, like "\"between\" expected" and "??? expected", seem to be some internal stuff, which I have no idea how to trigger.

This means it's not a requirement for the new filter parser to report errors exactly the same way as the existing parser does.

pickfire commented 4 years ago

@Minoru What's next?

Minoru commented 4 years ago

@pickfire The next, and the last, thing is to finish #701. I'm still working on it.

Minoru commented 4 years ago

As my comments in #701 show, existing filter parser and matcher both have interesting corner cases, and I took this rewrite as an opportunity to fix them. Beyond any doubt, the new implementation has its own corner cases, and also bugs. Thus, it would be best if we don't just switch over to the new stuff; instead, there should be some grace period when two implementations co-exist, and their results are compared. Any discrepancies in the results should be reported to us and either fixed or explained. The question is: how do we do it?

We need a solution that:

  1. forces or, at least, strongly encourages users to report problems;
  2. reports a minimum amount of information, so that users are comfortable sharing it, and we aren't overwhelmed with irrelevant or repeating stuff.

@tsipinakis, @noctux, and me discussed this on IRC a while back, and came up with the following ideas:

I had some counter-points to each, but re-thinking it now, I came up with the following:

  1. Matcher class keeps a static std::set<std::string> which is a set of errors. Each error mentions its kind and the filter expression that caused it. There are two possible kinds: Rust and C++ disagreed on compilation (one compiled, the other errored), or Rust and C++ disagreed on the result (one thinks it matches, the other thinks it doesn't)

    The size of that std::set is bounded by how many different filter expression a user has. It's virtually impossible for someone to have so many expressions that errors stop fitting into memory.

  2. If the set of errors in not empty when the program finishes:

    1. write errors to a temporary file
    2. after hiding Newsboat's UI, print out a message to the user asking to submit a bug with the temporary file attached
    3. wait for Enter or EOF

With that, the overall plan for this issue is as follows:

  1. publish Rust implementation of parser and matcher, but make its C++ bindings separate from existing FilterParser and Matcher. Call them RustFilterParser and RustMatcher perhaps?
  2. implement the idea outlined above
  3. make it trust the old parser/matcher by default, i.e. on errors, return what the old parser/matcher returned
  4. make at least one scheduled release. Wait for bug reports, fix/explain them. In the meantime, also work on making the error messages friendly to the users
  5. make it trust the new parser/matcher by default. Users who were ignoring error messages are now forced to report bugs, since Newsboat doesn't evaluate filter expressions the way the users want
  6. make at least one scheduled release. Wait for bug reports, fix/explain them
  7. drop C++ implementation altogether

Stages 4 and 6 might span multiple releases if we can't keep up with the bug reports.

Anyone seeing any holes in this?

Minoru commented 4 years ago

wait for Enter or EOF

Or call libc::isatty() on stdin to see if there's any point in asking for confirmation.

Minoru commented 4 years ago

Steps 4 and 5 also give us time to work on error messages. For my initial PR, the only error-related requirement is for the parser to report existence of errors. To make Rust parser the default one, we'll have to make its messages useful to the users.

Minoru commented 4 years ago

After a bit more thinking and a prod from @AnthonyMikh, I realized that we can use fuzzing and property testing to compare parsers/matchers! This is much more attractive approach: it offers faster feedback; it doesn't inconvenience users; and finally, it's personally interesting to me :)

  1. Use fuzzing to check that parsers react the same way for each possible input. Since our new parser doesn't (or, at least, shouldn't) reject any inputs that the previous one didn't, we should always see both parsers succeeding or failing simultaneously.

    We can't directly compare the resulting parse trees, though, because the new parser "remembers" types of RHS arguments, so e.g. age > 007 effectively becomes age > 7 in the new parser, while the old one stores "007" as-is. I think we could use the information from the new parser's tree to convert the info of the old tree into an appropriately typed value, and compare that.

  2. Use QuickCheck-style property testing to generate well-formed filter expressions and fill a MockMatchable with some data, then test that both matchers return the same result.

    But wait, won't that give us a ton of false positives as the suite finds all the things we intentionally broke, like comparing a string to a number? Well, here's the trick: we'll introduce new errors for that. Right now, MatcherError only contains two possible errors, but we can add "type mismatch" to that list (we'll need it anyway, I just thought I'd add it after comparing the implementation). The test will check for that error, and if it pops up, it won't fail because it's clearly an intended behaviour of new parser.

Minoru commented 4 years ago

900 is merged now, so I updated the TODO in the top post. Two fronts are open right now:

  1. better error messages from FilterParser. This task is pure Rust. Involves figuring out nom's VerboseParseError type and how to translate it into a useful message like "found =!, expected one of: =, ==, !=, =~, !~".
  2. prevent panics on expressions like title = 42. As mentioned in #900, current Rust Matcher assumes that the expression is well-formed and the user is not trying to compare incompatible types (like string and number in title = 42). We can't make this assumption in real life, so we have to:
    1. extend C++ test suite to check for this situations
    2. port these new tests to Rust
    3. introduce a new MatcherError called "type mismatch" or something;
    4. make the aforementioned tests pass (though I expect C++ version to return some "default" value instead of "type mismatch", so tests will have to be updated a bit)

I'm still busy investigating #232, but I do plan to get back to this eventually. But I won't mind if in the meantime someone works on any of this as well :)

Minoru commented 4 years ago

Upon re-reading this issue, it becomes clear to me that it was a mistake to make the new parser store numbers as integers. We should've mimicked the old parser's behaviour, and store them as strings. That would make the results directly comparable, making it almost trivial to check for regressions. I am now working on a PR that undoes this mistake.

I still believe in everything I wrote in https://github.com/newsboat/newsboat/pull/701#issuecomment-593002181, but I think the parser is the wrong place to introduce the types. We should start from the other end, i.e. Matchable, because that allows us to make the change in smaller steps without breaking anything. I filed #1053 about this.

pickfire commented 4 years ago

make Rust FilterParser log how long it took to parse the filter make Rust Matcher log how long it took to check for match

Looks like some sort of performance log, I believe I can look into this. (it may take some time before I have some time to work on this)

Minoru commented 4 years ago

Don't reserve it if you aren't going to work on it just yet ;)