Use angular's parser to analyze expression instead of Regex

yankee42 commented 9 years ago

The extractor currently finds filter expressions using regex. This has several drawbacks:

It leads to false positive (e.g. {{"Hello"|lowercase}} {{"Second"|translate}} will cause Hello"|lowercase}} {{"Second to be found as translatable string
It does not recognize expressions expressions if they are slightly more complex. E.g. {{ foo ? ('foo'|translate) : ('bar'|translate)}} will go unnoticed
It is hard to read. The regex currently says new RegExp(start + '\\s*(\'|"|"|')(.*?)\\1\\s*\\|\\s*translate\\s*(' + end + '|\\|)', 'g'); which is hard to read even though it is still pretty simple. It will get much harder when it is extended with further options such as contexts and comments or when trying to do something about the previously mentioned problems.

Thus I explored other options and had a look at how angular is parsing these expressions. And they are using a small recursive descent parser with the manageable amount of 16 node types. Using their parser would put us in a state where there will be no trouble with comparing syntaxes between gettext's parsers and the original.

Unfortunately the angular guys did not seem to anticipate that someone else but them will use their parser. But with a little bit of hacking it is possible. I created a proof of concept here: https://github.com/yankee42/angular-gettext-tools/blob/extract-using-angular-ast/test/parse.js

It is actually quite easy except of the hacks I needed so that angular runs on nodejs without browser and the hack I needed to get hold of AST and Lexer. I am no nodejs expert so maybe someone can come up with a better idea? Possibly angular's parse.js file could be used stand alone, but this requires figuring out which files are all dependencies and need to be included as well. My current approach at least has the advantage that the parser can be kept up to date with npm.

If you think that this is interesting I'd try to actually integrate this into gettext.

kvantetore commented 9 years ago

I find this approach very interesting. The current regex approach is very limited, and most of the advantage of angular-gettext is the ability to automatically extract translation strings, so improving that aspect seems very important to me.

It's a pretty funky hack to get access to the angular parser. Glancing on parse.js, it looks like it's only dependent on $filter, and even that looks like its using $filter mostly to determine if an expression is constant or not (which we don't really need). So I guess it should be readily mockable.

kvantetore commented 9 years ago

In kvantetore/angular-gettext-tools@7e5e5200e06b9a24fb33206f8f6eea7135dbaec2, I've changed the parser from regex based to the angular expression parser based on the work by @yankee42. I can format a proper pull request if there is interest in folding into the mainline.

Instead of looking for interpolation expressions globally in the html template, the new approach is to parse the attributes and text-tags of each node separately. This has caused a minor regression in the test suite, namely in "extract/Extracts strings from non-delimited attribute", the line number of the second string is reported to be on line 3 instead of 5

rubenv / angular-gettext-tools

Use angular's parser to analyze expression instead of Regex #105