micromatch / glob-fs

file globbing for node.js. speedy and powerful alternative to node-glob. This library is experimental and does not work on windows!
http://jonschlinkert.github.io/glob-fs
MIT License
55 stars 17 forks source link

wishlist #1

Closed jonschlinkert closed 9 years ago

jonschlinkert commented 9 years ago

Wishlist

Some user features that would be nice to have in a node.js file globbing lib:

API

Code

tunnckoCore commented 9 years ago

lol, we are twins we think in the same way!

drwxr-xr-x  9 charlike charlike  4096 Dec 31 10:30 .
drwxr-xr-x 25 charlike charlike  4096 Dec 31 12:53 ..
drwxrwxr-x  5 charlike charlike  4096 Dec 30 19:36 coreflow
drwxrwxr-x  3 charlike charlike  4096 Dec 30 13:09 dotfiles
drwxrwxr-x  5 charlike charlike  4096 Dec 31 11:58 glob-fs
drwxrwxr-x  6 charlike charlike  4096 Dec 31 10:25 kdf
drwxrwxr-x  5 charlike charlike 4096 Dec 31 11:45 benchmark
-rw-rw-r--  1 charlike charlike  594 Dec 31 10:29 .editorconfig
-rw-rw-r--  1 charlike charlike  858 Dec 31 10:29 .gitattributes
-rw-rw-r--  1 charlike charlike  603 Dec 31 10:29 .gitignore
-rw-rw-r--  1 charlike charlike   44 Dec 31 10:29 history.md
-rw-rw-r--  1 charlike charlike 3376 Dec 31 11:58 index.js
-rw-rw-r--  1 charlike charlike 2702 Dec 31 10:29 .jscsrc
-rw-rw-r--  1 charlike charlike  320 Dec 31 10:29 .jshintignore
-rw-rw-r--  1 charlike charlike  701 Dec 31 10:29 .jshintrc
-rw-rw-r--  1 charlike charlike 1101 Dec 31 10:29 license.md
drwxrwxr-x 15 charlike charlike 4096 Dec 31 10:33 node_modules
-rw-rw-r--  1 charlike charlike 1130 Dec 31 10:31 package.json
-rw-rw-r--  1 charlike charlike 1952 Dec 31 10:29 readme.md
-rw-rw-r--  1 charlike charlike  691 Dec 31 11:13 test.js
-rw-rw-r--  1 charlike charlike  238 Dec 31 10:29 .travis.yml

Yesterday I wondered between glob-fs and fs-glob names and thought of this pkg and I made some raw impl xD But the difficult part was implementing benchmarks with your benchmarked xD The concept and idea works in tests, but not in benchmarks, it just skip it, however.

tunnckoCore commented 9 years ago

allow ignore patterns to be passed on the options

Why? When we could define them in string/array approach? Looks needless.

how much flexibility to we want to provide for negation patterns?

Why is this question? We have micromatch? So much as micromatch supports?

is it important to allow users to un-negate previously negated patterns?

No logic for me.

jonschlinkert commented 9 years ago

Why? When we could define them in string/array approach? Looks needless.

1) b/c currently node-glob is a pita to use for negation, and I want to make it easier 2) when implementors (versus end-users) want to define negation patterns dynamically, this offers a more convenient alternative 3) explicitly defining negation patterns is faster than doing a check on every string, replacing ! and shuffling patterns around. probably a lot faster, but it might not be noticeable since the slowest part of the operation is always going to be hitting the fs

how much flexibility to we want to provide for negation patterns? ... We have micromatch? So much as micromatch supports? ... is it important to allow users to un-negate previously negated patterns?

No, not necessarily. There are material differences in how the matching is done in micromatch and what is required for file matching. To explain:

The short version is that we have an opportunity to implement a strategy that either uses the matcher (micromatch/minimatch) for filtering on the fly, or that prevents building up lists of files that shouldn't even be passed to the matcher, etc. I have implemented both strategies, both work, and both are much faster than node-glob.

Here is the longer version if it helps to have more detail.

micromatch

Like minimatch, micromatch has no knowledge of the file system. It's just a pattern matching tool. In other words, when you pass an array of elements to be filtered, it might be an array of files, or keywords, or anything else. micromatch just filters that array wholesale based on the patterns provided.

file system globbing

This is a different challenge. Recursing directories is slow. Regardless of how fast or slow it "feels" to the end-user, or whether it's sync, async or whatever... simply hitting the file system is the slowest part of the globbing operation. The point is that we want to avoid hitting the file system whenever possible, and in a perfect world, files and directories should be filtered on-the-fly to prevent picking up files or recursing into directories that have been negated or to only recurse into directories that match the given patterns.

By comparison, let's say you want to filter files from node_modules. node-glob will recurse all the way down and build up an array of everything in node_modules before filtering. This is a bad strategy if we can avoid it. It's possible to include/exclude as we go, and it's orders of magnitude faster (I'll try to do a POC for comparison).

This is all a long way of saying that:

jonschlinkert commented 9 years ago

cc @contra, @sindresorhus, @UltCombo, @doowb (including everyone from the discussion on https://github.com/wearefractal/glob-stream/pull/27).

tunnckoCore commented 9 years ago

Yea, right. I know the diffs.

Here's the perfect place to mention @thlorenz and his https://github.com/thlorenz/readdirp. IMO it is the fastest implementation I ever seen for recursively readdir and beats every other reading dir lib, including node-glob and glob-stream. But yea, the costs are it not support common used glob patterns which we see in gulp/minimatch and other matching libs. I mean, not support double globstar, defining positive and negative patterns are separated and etc, which is confusing and not so flexible (good looking) in my opinion.

UltCombo commented 9 years ago

I'm no globbing expert, but here are my .02.

About re-inclusion, it would be nice to have this feature built-in in the globbing library, IMHO. But if not, it is relatively easy to workaround -- for instance, globby and glob-stream are packages that wrap node-glob to add support for multiple glob patterns and re-including excluded paths, the same strategy could be applied to glob-fs if it doesn't support re-including excluded paths natively.

By the way, I really like the .gitignore analogy. Perhaps it would be the best approach to handle exclusion and re-inclusion? For instance, !dir/ will never touch dir nor anything inside of it, while !dir/** allows re-including files inside of it. E.g.:

!dir/
dir/foo.txt

Would not match anything, while:

!dir/**
dir/foo.txt

Would match dir/foo.txt. But anyway, as I said before, if this is too complicated then re-inclusion can be done by a wrapper lib.

jonschlinkert commented 9 years ago

@tunnckoCore thanks, yeah that's a nice lib. I've used it on a few projects.

for instance, globby and glob-stream are packages that wrap node-glob to add support for multiple glob patterns

That's a good frame of reference. I did the pr on multimatch to do the exclusion the way it works now. If you look at how micromatch handles arrays of patterns it's similar. So I guess what I was saying earlier is that the strategy used on globby/glob-stream is good if we can get away with not doing re-inclusion. Which would probably be my vote, I'm not sure yet.

if this is too complicated then re-inclusion can be done by a wrapper lib.

It's not that it's too complicated, it's more related to speed. The main performance hit from re-inclusion is that all files need to be stored, regardless of whether or not they have been excluded already.

Or this could be exposed as an option. e.g. "advanced" matching or something. That way most users will get the speed benefits of not doing re-inclusion, but it's there if you need it occasionally.

I had also considered allowing the user to define !! in front of a re-inclusion pattern, so we can do a "peek" or a kind of lookahead on the patterns to see if the user would be re-including. but... IMO that's too complicated. users shouldn't have to read a manual to figure out how this works.

UltCombo commented 9 years ago

@jonschlinkert I still don't quite see where the perf hit is at yet. Can you post an example explaining the extra logic steps it would take?

As far as I can see, taking a simple example such as ['**', '!a/**', 'a/b'], the steps would be:

  1. Glob all files, but don't walk the a directory.
  2. Glob a/b.

This means, start a new globbing process for re-inclusion instead of walking the excluded paths. It is efficient for this use case, but perhaps not as much when you're not excluding whole directories.

UltCombo commented 9 years ago

I assume the logic you have in mind would be to execute only a single glob process and match each path against all glob patterns, is that it? Then I see where the perf hit is.

jonschlinkert commented 9 years ago

Can you post an example explaining the extra logic steps it would take?

sure, no prob. I'll try to get something posted up here in a bit

jonschlinkert commented 9 years ago

okay, so this is way more than most people will want to know, but it helped me think this through to get it written out...

to execute only a single glob process and match each path against all glob patterns

yes, I think it's at least worth a poc. and you have a point about re-inclusion, maybe I'm overthinking that. It depends on details I guess.

Can you post an example explaining the extra logic steps it would take?

I'll preface by saying that none of this is necessarily complex or unfamiliar to devs. It's just that there are so many little issues that they add up to create a problem worth solving.

To put this in perspective, it might help to take a look at how micromatch does brace expansion in micromatch. Brace expansion is actually a really good analog to globbing with multiple patterns (think globby vs. native minimatch). So I'll use this as a convenient example... but this is only one area where we can make material improvements.


The minimatch approach is to take a pattern like **/*.{json,yml,yaml} and:

So in minimatch, the resulting regex looks like this:

/^(?:(?:(?!(?:\/|^)\.).)*?\/(?!\.)(?=.)[^/]*?\.json
  |(?:(?!(?:\/|^)\.).)*?\/(?!\.)(?=.)[^/]*?\.yml
  |(?:(?!(?:\/|^)\.).)*?\/(?!\.)(?=.)[^/]*?\.yaml)$/

We don't need a massive, redundant regex just to match different file extensions. Plus, under the hood

So the perf hit is exponential, or O(2n).

By contrast, micromatch checks the pattern to see if we can confidently get away with converting directly to regex (to stay as close to O(n) as possible). So in micromatch, the same brace pattern results in:

/^(?:(?:(?!(?:\/|^)\.).)*?\/(?!\.)(?=.)[^\/]*?\.(json|yml|yaml))$/

When braces are nested or have characteristics of being more complex, then we resort to full brace expansion. The result is that - with patterns that have braces - micromatch is 5-25x faster (and despite micromatch also supporting a richer feature set for brace expansion). In other areas micromatch is 50-70 time faster.

If we just follow the same reasoning to file globbing, these benefits should accumulate quadratically. We just need to decide on end-user featurese and what the API needs to look like. If I need to make changes in micromatch to accommodate, that's fine. Maybe I'll start adding some code to discuss.

tunnckoCore commented 9 years ago

maybe I'm overthinking that

Yea, Im sure we overthinking it. ;d

Okey, I think we should first create some really good possible real world fixtures and real world complicated patterns to not overthinking the problem and handle it in right way. Because I cant for now realize the big problem. Yea I know the differences between both, know the cost hitting the fs.

Yea, micromatch is awesome. And we will use (or should use) it in one way or another, because part of the idea behind glob-fs/node-glob is exact what matching libs does - testing filepaths (strings) against some patterns (other strings) - comparing strings at all. Another part (main part) is to handle fs.readdir and handle patterns - I think we should separate patterns to negative and positive.

Yea, I have really big need of POC, lol ;d or at least fixtures.

jonschlinkert commented 9 years ago

you're right.

I think we should first create some really good possible real world fixtures and real world complicated patterns to not overthinking the problem and handle it in right way.

agreed, we can use micromatch and brace expansion to create test fixtures :)

tunnckoCore commented 9 years ago

@jonschlinkert Could you create readme.md and I can fork it and then I'll share my idea?

tunnckoCore commented 9 years ago

Hm. I think that glob-fs, if save that name, should handle globbing for more than fs.readdir method like fs.readfile, etc.

tunnckoCore commented 9 years ago

Or we can choose fs-readdir (which I already take) to handle all of sync/async/stream/promise flow.

tunnckoCore commented 9 years ago

it's here btw https://github.com/tunnckoCore/fs-readdir/tree/glob-fs as sub folder + benchmarks + non-glob stream/async impl fs-readdir

jonschlinkert commented 9 years ago

Could you create readme.md and I can fork it and then I'll share my idea?

I can just generate the boilerplate, is that okay? Not much else to say yet until the API comes together.

should handle globbing for more than fs.readdir method like fs.readfile, etc.

Meaning actual file reading? Hmm, I'd rather this focus just on fast globbing. but maybe I misunderstood. Even features like what globule does, I think it would be better to let globule do that and just make this a drop-in replacement for node-glob.

Or we can choose fs-readdir

I'm open to whatever. My main goal is a good user experience, speed and consistency. @doowb also did some async/stream code for this, but I'll tackle that when I feel good about the matching and file path handling. for example, ensuring that returned paths are calculated correctly from what the user "expects" the cwd to be.

I'll just push up what I have so far

tunnckoCore commented 9 years ago

I can just generate the boilerplate, is that okay?

Yea.

Meaning actual file reading?

Mm nope. But nevermind.

and just make this a drop-in replacement for node-glob

Agreed. :+1:

tunnckoCore commented 9 years ago

The job around is interesting, yea. But as we all know we don't want just "next glob lib". So at first we need some awesome readdir async lib, then awesome stream lib and after all that we can pair them with mm to produce the awesome glob lib lol. After that we can replace glob-stream.

tunnckoCore commented 9 years ago

thoughts about this start https://github.com/tunnckoCore/glob-fs?

tunnckoCore commented 9 years ago

it implements almost all of the glob-stream options, actually without root and allowEmpty and have behaving as vinyl-fs src when options.src is true, also supports vinyl-fs since option

tunnckoCore commented 9 years ago

look and try it - comapring and unifying the api between glob-fs, readdirp, readdir-stream, glob-stream

replace #L3 with glob-fs from the repo

tunnckoCore commented 9 years ago

actually, it support also allowEmpty option see here

tunnckoCore commented 9 years ago

sadly, it dont pass few tests and i guest it won't pass them in future, because glob-stream/vinyl-fs/minimatch/glob does strange things and tests strange things for me. like why should emit error when no found (no matches)?

tunnckoCore commented 9 years ago

@jonschlinkert @contra @UltCombo @phated ping-pong :) thoughts when you can ofc :+1: :)

yocontra commented 9 years ago

@tunnckoCore What does this do to replace vinyl-fs/glob-stream that is better and not just a rewrite for the sake of writing your own?

tunnckoCore commented 9 years ago

Oh c'mon, we all can agree that the community need better and faster than node-glob + minimatch combo. We can stuck with them forever. Awesome replacement of minimatch already appeared and we see its results, its time for glob library. And this should be purpose number one, the top of our todo lists, before doing anything else.

We all feel and see the performance hit of them and glob-stream dependents.

Always can better. And one of our purposes as devs is to do the things better and better in future.

Yea, for now its not better than them. I'm not saying anything. Its just my implementation, my partial point of view. Im not against you and your awesome team and work around gulp and its ecosystem - plugins, vinyls, glob-stream, etc and of course upcoming undertaker.

We are here to discuss next glob library, future of file globbing in node, next node-glob. :)

Regards, Charlike.

jonschlinkert commented 9 years ago

@tunnckoCore I have to agree with what @contra is saying.

I think we can all agree that innovation is a good thing, if not for any other reason than to motivate progress. on one hand, developers shouldn't need any more justification to create a library than "I think I can do this better". But to get the community interested and involved, there needs to be a fundamentally different approach that results in much better performance or usability or flexibility or whatever. but your approach just makes it seem like you want to replace other libs with your own

tunnckoCore commented 9 years ago

I think we can all agree that innovation is a good thing, if not for any other reason than to motivate progress

exactly

but your approach just makes it seem like you want to replace other libs with your own

im sorry if it seems like that. just saying that we should work together for this. its just basic and absolutely not-ready implementation and it is a lot faster. im not locking you guys with my vision and first-shot implementation. its not end-product, this is clear.

yocontra commented 9 years ago

@tunnckoCore Why not just submit PRs to existing libraries instead of starting from scratch?

tunnckoCore commented 9 years ago

fundamentally different approach

right, but not in any case. I created my repo with same name, just because there's no fork button here. And im not going to publish it.

We can contribute to the awesome readdir-stream to add some new and needed things, then we would be able integrate it in glob-stream, actually make more sense directly in gulp. Cuz one module like this would dismiss almost completely use of glob-stream and vinyl-fs (if it is done correctly), just because they are based on slow tools and their architecture seems like plugins on top of node-glob stream which adds this performance hit - L40-41, L45, L50, L55 of vinyl.src and we see this isnt needed to accumulate same results. It can be done in core not as plugins on top of node-glob, on top of slow lib.

Im not attacking you, its just fact.

tunnckoCore commented 9 years ago

Mm yep, its possible, I just said it. lol

tunnckoCore commented 9 years ago

@jonschlinkert I cant wait to see your impl :)

tunnckoCore commented 9 years ago

Why not just submit PRs to existing libraries instead of starting from scratch?

and b/c want to try to do it. im not forcing anyone.

tunnckoCore commented 9 years ago

But to get the community interested and involved, there needs to be a fundamentally different approach.

but also to get interested and involved we should support already known features and options of our parents - as you did with micromatch :) this is just a process, yea

jonschlinkert commented 9 years ago

Why not just submit PRs to existing libraries instead of starting from scratch?

Have you ever actually looked at the code from minimatch or node-glob? Despite how bad the code is for both libs, my first approach was to do a pr to minimatch before creating micromatch. It didn't go as I'd hoped, so I created micromatch.

But if it's a matter for debate, the same could have been said for gulp. Why not do a pr to grunt? Or vinyl, why not do a pr to record? I'm not saying you should have, clearly gulp has a fundamentally different signature and approach. But vinyl, not so much. Regardless, I don't think you or I need any more justification to create an open source project than: "I can do better, and I will maintain it".

yocontra commented 9 years ago

@jonschlinkert FWIW vinyl is totally different from record so that isn't a valid comparison. vinyl is as different from record as gulp is to grunt.

jonschlinkert commented 9 years ago

@contra I'm sure you're right, it's been a while since I thought about this, or looked at the record code. I use vinyl quite a bit though. IMHO, the most valuable thing about vinyl is the philosophy behind it and the execution. regardless I'm grateful to have the freedom to take advantage of one or the other, or both if I want. plus innovation is more likely to happen when different libraries attempt to attack the same problem, even if the approach only differs slightly. just my 2c