vitaly-t / decomment

:hammer: Removes comments from JSON, JavaScript, CSS, HTML, etc.
75 stars 21 forks source link

options #1

Closed RnbWd closed 8 years ago

RnbWd commented 8 years ago

I can help with the gulp repo, and I think it should potentially merge with the gulp-strip-comments plugin. If we could add options for turning on/off block and inline comments - that'd provide full compatibility with the repo that already exists, but it's not that important.

vitaly-t commented 8 years ago

The priority right now is to cover 100% of regular expressions. In all fairness, I don't know yet if that's even possible. Certain specific cases are incredibly complicated, because they offer no usable parsing logic.

My implementation covers at least 99% of all cases, but the remaining 1% is incredibly difficult.

Here's an example:

func(1/2, /"some/*text"/, 3/4)

I am hoping to nail it later on, somehow, not sure how yet...

On the positive side though, strip-comments library doesn't parse regular expressions at all, fails on almost all of them. Not sure you like knowing that though, but that's a fact :) And this is one of the reasons I wrote my own library.

Also see: Why I wrote it

vitaly-t commented 8 years ago

I've done tons of work in this area, only to come to a terrifying conclusion - this cannot be done without full expression evaluation, which implies use of a third-party parser, which wouldn't be welcome. But I don't see what is there...

Certain regular expressions are simply impossible to parse without a full whole-line evaluation and parsing.

RnbWd commented 8 years ago

Honestly I just use babel where it's trivial to write a plugin to strip comments:

https://github.com/thejameskyle/babel-plugin-remove-comments

export default function ({Plugin, types: t}) {
  return new Plugin('remove-comments', {
    visitor: {
      Program(node, parent, path, file) {
        file.ast.comments = [];
      },
      enter(node) {
        node.leadingComments = null;
        node.trailingComments = null;
      }
    }
  });
}

I know there are limitations with regex expressions and I ran into problems with the strip-comments repo before I even published gulp-strip-comments. But - it did what I needed it to do at the time, and that's all that mattered. I don't even use the gulp plugin, I forgot it existed - fast forward 6-8 months there's 5000 downloads a month on npm.

Here's my philosophy on the subject - I know regex's won't cover 100% of edge cases - but i don't want to bring in esprima like strip-comments did recently and bloat the library. I want it to be as simple and lightweight as possible - hence regex.

It's also being used by people, probably in production, and so legacy support is a big concern. I am really shitty at regex. But the best repo I read the source code of the other day is this: https://github.com/mafintosh/is-my-json-valid

//formats.js
exports['date-time'] = /^\d{4}-(?:0[0-9]{1}|1[0-2]{1})-[0-9]{2}[tT ]\d{2}:\d{2}:\d{2}(\.\d+)?([zZ]|[+-]\d{2}:\d{2})$/
exports['date'] = /^\d{4}-(?:0[0-9]{1}|1[0-2]{1})-[0-9]{2}$/
exports['time'] = /^\d{2}:\d{2}:\d{2}$/
exports['email'] = /^\S+@\S+$/
exports['ip-address'] = exports['ipv4'] = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/
exports['ipv6'] = /^\s*((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?\s*$/
exports['uri'] = /^[a-zA-Z][a-zA-Z0-9+-.]*:[^\s]*$/
exports['color'] = /(#?([0-9A-Fa-f]{3,6})\b)|(aqua)|(black)|(blue)|(fuchsia)|(gray)|(green)|(lime)|(maroon)|(navy)|(olive)|(orange)|(purple)|(red)|(silver)|(teal)|(white)|(yellow)|(rgb\(\s*\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b\s*,\s*\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b\s*,\s*\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b\s*\))|(rgb\(\s*(\d?\d%|100%)+\s*,\s*(\d?\d%|100%)+\s*,\s*(\d?\d%|100%)+\s*\))/
exports['hostname'] = /^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$/
exports['alpha'] = /^[a-zA-Z]+$/
exports['alphanumeric'] = /^[a-zA-Z0-9]+$/
exports['style'] = /\s*(.+?):\s*([^;]+);?/g
exports['phone'] = /^\+(?:[0-9] ?){6,14}[0-9]$/
exports['utc-millisec'] = /^[0-9]+(\.?[0-9]+)?$/

It's possible - it's not easy - but I'm optimistic some crazy hack of a regex set of expressions can account for 99+% of comments

vitaly-t commented 8 years ago

The way my library currently works, it is good for 99% of all regEx cases, but it is not good enough for general public consumption, which requires 100%.

I know that the only way to cover the remaining cases is via full-expression evaluation, which in turn requires an engine built on tokenatization, like esprima. In fact, I have looked closely at esprima, and found out that it is not usable, because the tokens it generates do not include token positioning within the original document.

Then I switched my attention to UglifyJS, which is very fast, and the best part - its tokens do contain the positioning, so I started digging.

See my recent question on Stack Overflow: Enumerate regular expressions via UglifyJS

At the moment I see UglifyJS as a very viable option for implementing a 100% reliable comment remover.

What do you think of that?

RnbWd commented 8 years ago

Have you played with the AST explorer? It lets you prototype with 11 parsers, including uglify-js.

IMO babel/babylon6 has by far the most potential (in general). During the transition between v5-v6, the maintainer set a goal to make it the most efficient code-minimizer available, and I wouldn't be surprised if it become more efficient than uglify-js in ~ 6 months, although uglify-js is probably the best at the moment. The new babylon parser decoupled all the transform modules from the library itself, which has encouraged an ecosystem of plugins. Babel already has an option for stripping comments - and I assume anyone using babel wouldn't need a 3rd party library like this for stripping comments.

The main issues with gulp-strip-comments so far have been related to angular. If your regex handles the edge cases of angular, that's HUGE. It would fix all the issues people have brought up so far. But in the spirit of actually stripping all the comments let me know if you find something in AST explorer that's useful. I'm curious how acorn / espree compare with uglify. espree is a fork of esprima for eslint which uses comments for lots of functionality so there might be something in there, and acorn is my personal favorite b/c of it's simplicity (babel is sort-of a fork of acorn)

EDIT: I can't get the astexplorer to parse comments - but that's probably a code-mirror setting.

RnbWd commented 8 years ago

Sorry I can't provide more technical help to implement this - I don't have formal education in CS but i'm still learning :) I know a lot about the tools and how to glue things together. The growth of this ecosystem has exploded in the last year, and a lot of that growth has centered around tools like transpilers, linters, minifiers. The formalization of ES6 specs really catalyzed this explosion. But all these choices are can create redundancy in the build-chain (it's common to have 3-4 different AST parsers in a gulp workflow) so I like the idea of having a simple regex strip the comments - assuming that down the line (in the workflow) there's another parser that can really take care of everything.

With that being said - there can be two repos. Keeping gulp-strip-comments as a regex is my preferred method, while decomment and gulp-decomment could be more robust and utilize techniques for 100% coverage. What do you think? So I'd like to put the regex in gulp-strip-comments itself, not rely on any third party modules, but also help out with decomment / gulp-decomment utilizing these parsers.

vitaly-t commented 8 years ago

Thank you for so much input into this! The version 4.2 the current, does handle entire AngularJS without issues, as far as I can see. And I am mentioning it on the main page: Performance.

Yesterday I finished my rewrite of the library with the use of Uglify JS, apart from some small changes still required. The main issue I have run into so far - its support for ES6. There is only a development branch, one that I haven't tried yet. I'm gonna check it in as a separate branch a little later.

And I will look at that AST explorer also. Cheers!

vitaly-t commented 8 years ago

Having spent some time with AST Explorer, I can see now the differences between various parsers.

And it tells me that UglifyJS isn't a good choice, because its support for ES6 is still not good. They just added support for template strings, but still do not support generators, for example.

I'm afraid I have to look at other parsers, ones that do support ES6 properly.

Damn, this project seemed so easy at first... :) At least, the current 4.2 version is as good as it gets, small and fast.

vitaly-t commented 8 years ago

I went through the list of existing parsers, along with their support and all, and I decided to have another look at esprima.

Shockingly, not only I discovered that I was wrong about its reporting element positions, I found those, but it works almost 2 times faster than UglifyJS, while also properly supporting the entire ES6.

That is a nice surprise. You mentioned that you didn't like esprima for some reasons. Could you please tell me why? 'Cos I'm really beginning to like it...

I am now changing my library to a generic AST syntax, so any of those parsers can be connected, relatively simply.

vitaly-t commented 8 years ago

Here's a very useful performance comparison.

Makes me have a look at Acorn also.

UPDATE: Checked out Acorn, it is the fastest, but it has no support for ES6, so not interesting.

So far Esprima looks like the best choice all around.

RnbWd commented 8 years ago

When I started using React about 2 years ago, esprima wasn't nearly as active or performant as it is today. Facebook actually maintained a fork until recently, esprima-fb, to parse JSX and some ES6 features. They use babel now for React and FB itself, and they've depreciated the fork of esprima. It's supposedly really good now, but 6to5 (now babel/bablyon) already 'took over', at least in my neck of the woods :goat:

vitaly-t commented 8 years ago

I've just run into an issue with esprima - it doesn't report line numbers, it seems...

UglifyJS does, and so does babel...

Lack of line numbers within AST structure complicates its usage considerably.

RnbWd commented 8 years ago

Looking at their website, it's interesting that they support ES6, but also the estree spec, which I remember being formed out of babel when another team building an es6 parser joined forces and the community just gravitated towards babel - it happened so fast, but they (everyone) pioneered the standards, just like spontaneous open source not sponsored (well now it's sponsored by FB, but at the time it was just some random dude's side project after graduation)

vitaly-t commented 8 years ago

My bad on the previous post, esprima does support locations, if option loc is passed in :) so far it's looking very good.

RnbWd commented 8 years ago

Have you looked at the other repo? https://github.com/jonschlinkert/strip-comments I think they use esprima somewhere but it's a separate module now (same github dude)

vitaly-t commented 8 years ago

I looked at the packaging of that from day one, no reference to any parser at all. There is a good reason it breaks regular expressions during parsing :)

RnbWd commented 8 years ago

I still just want to use the regex haha

vitaly-t commented 8 years ago

RegEx itself is incredibly slow, if you try to apply it against a 30,000 lines file, it is 100 times slower than through manual analysis. I had it run strip-comments against AngularJS 1.5 - took 4 seconds, with bad output. My parser does that in 40ms, with correct result. Relay this to a large project, you will find regEx-based solution is not usable :)

RnbWd commented 8 years ago

Sorry I'm confused - how are you stripping these comments?

vitaly-t commented 8 years ago

The parser I referred is the latest code you can currently see within decomment ;) No use of regEx ;)

vitaly-t commented 8 years ago

In all, we have 2 contenders, it seems - esprima + babel. I have looked through the former one, dunno anything about babel yet...

RnbWd commented 8 years ago

that's some crazy fucking code

vitaly-t commented 8 years ago

U mean my current parser? :) Crazy good or crazy bad? :)

vitaly-t commented 8 years ago

Either way, bringing in an AST parser simplifies my code a lot, much easier to understand and to maintain ;)

RnbWd commented 8 years ago

crazy good haha

I think esprima is probably better use case for this scenerio - babel excels at it's extensibility and customization, esprima might be more performant

vitaly-t commented 8 years ago

I'm gonna run my tests against babel now, will post the results in a moment... just to have the comparison finished.

RnbWd commented 8 years ago

I want to use the current implementation in gulp-strip-comments (which is why I added you as maintainer) but I don't expect much maintenance. Honestly I created the library for myself, used it for a week, forgot about it for 6 months until I started getting emails about Angular-related errors and noticed it was being downloaded like 5000 times a month.

vitaly-t commented 8 years ago

You'd get more downloads, if it worked always as it should ;)

Looking at the babel, it seems more complicated than your regular AST. Their basic parser isn't even AST-copliant, that's why AST explorer uses https://www.npmjs.com/package/acorn-to-esprima

That's not encouraging, in terms of possible performance...

vitaly-t commented 8 years ago

Looks to me like Esprima is the best choice, and it gives everything needed.

The bottom line, I have organized my code in such a way, it is very easy to switch from one parser to another. All boils down to compiling an array of absolute indexes for all regular expressions. Here's my current code...

'use strict';

var uglify = require('uglify-js');

///////////////////////////////////////////////////////////
// Parses the code and returns a dictionary of relative
// coordinates, as `line index`: [{start, end},...], where
// {start, end} are the inclusive indexes of regEx content.
function parseThroughUglifyJS(code) {
    var parsedCode = uglify.parse(code);
    var data = {};
    parsedCode.walk(new uglify.TreeWalker(function (obj) {
        if (obj instanceof uglify.AST_RegExp) {
            var reg = obj.end;
            var name = reg.line - 1;
            if (!data[name]) {
                data[name] = [];
            }
            data[name].push({
                start: reg.col + 1,
                end: reg.endcol - 2
            });
        }
    }));
    return data;
}

function parseThroughEsprima() {
    // needs implementation;
}

module.exports = function (code) {
    return parseThroughUglifyJS(code);
};

I'm just finishing up up this driver for Esprima... which already works in my tests... That's all that takes to go from one parser to another. My algorithm needs only coordinates of all regular expressions.

And once I finish it all, I will have a chance to look at babel closer, and if it is better, the switch will be very easy.

RnbWd commented 8 years ago

I didn't even write the strip comments parser! I'm just the middle man who wrote a 10 line gulp-plugin :stuck_out_tongue_winking_eye:

Babel's goals are far more ambitious and complicated - Esprima is a better choice for this.

I built (modified) my own build system, chronic, it'll be revitalized one day.. it's much better than gulp IMO, but thankfully nobody uses it and I stopped development when iojs forked node b/c the ecosystem was so unstable. But I wrote an epic description hidden in the wiki. It's compatible with everything in the gulp ecosystem.

Anyways - getting this to stream instead of buffer as it's being parsed through something like a gulp-plugin would make a huge difference in how it performs and that's something I can help with - the I/O wrapper layer

vitaly-t commented 8 years ago

There is always a catch somewhere,...I found out that Esprima does not offer any intelligent traversal system, like UglifyJS does. One can only traverse the structure using the common JavaScript tree parsing, which is incredibly slow. See my research about this here: http://stackoverflow.com/questions/34524618/enumerate-regular-expressions-via-uglifyjs

When I'm trying traverse the AST generated by Esprima, it takes a whole 1 minute for my AngularJS test, which next to just 140ms of parsing looks like a shot in the head...

vitaly-t commented 8 years ago

I have finished my implementation with Esprima - it works perfectly with that AngularJS 1.5 huge file.

I have checked in the code into branch AST.

It will need revised tests and some code refactoring before merging into the main branch.

You can see here how it uses both parsers. I will of course remove the UglifyJS before the merge.

I also found and included a very nice library estraverse that can quickly and simply traverse through Esprima's AST tree.

vitaly-t commented 8 years ago

@RnbWd I have finished and released new version 0.5.0 of the library that uses esprima.

It can process that huge AngularJS 1.5 (30,000 lines) in about 200ms, which is an excellent result.

Now the library is 100% accurate.

See documentation. Small protocol update - I had to move CSS processing into a sub-call decomment.css because it is special, it doesn't use the JavaScript parser (esprima).

It is now ready for you to make a gulp plugin of it ;)

vitaly-t commented 8 years ago

These guys were amazing.

I was able to release the new version 0.5.1 that has 3 times the performance, chewing through that Angular monster in 85ms.

The parsing code also got so much smaller. Now it doesn't even do any parsing, it just does tokenization, that's all. I love that esprima library, it is so good, and support is excellent!

RnbWd commented 8 years ago

that's awesome! I think we can close this issue now, and i'll help out with the plugin. :+1:

ariya commented 8 years ago

but also the estree spec, which I remember being formed out of babel when

Correction: estree is mostly lifted from SpiderMonkey, Firefox's JavaScript Engine. See this blog post, perhaps also this video for the AST details. You could also say that SpiderMonkey invented the format (de jure) and Esprima popularized it (de facto). There is a reason why major JavaScript parsers today (including Babel) have copies or derived versions of Esprima exhaustive unit tests.

vitaly-t commented 8 years ago

@ariya Thank you! :+1:

vitaly-t commented 8 years ago

@ariya Part of the research I had to do in order to figure it all out: Enumerate regular expressions via UglifyJS

See my own answer there that was accepted :) Thanks again! :+1:

RnbWd commented 8 years ago

Yeah I'm totally aware it's derived from spidermonkey, I just remember when that particular repo was created, blog post - it was a joint effort but the 6to5 people really pushed for the standard. These other parsers and the spidermonkey AST had been around for years before there was any effort to standardize the syntax across multiple parsers