micromatch / picomatch

Blazing fast and accurate glob matcher written JavaScript, with no dependencies and full support for standard and extended Bash glob features, including braces, extglobs, POSIX brackets, and regular expressions. Used by GraphQL, Jest, Astro, Snowpack, Storybook, bulma, Serverless, fdir, Netlify, AWS Amplify, Revogrid, rollup, routify, open-wc, imba, ava, docusaurus, fast-glob, globby, chokidar, anymatch, cloudflare/miniflare, pts, and more than 5 million projects! Please follow picomatch's author: https://github.com/jonschlinkert
https://github.com/micromatch
MIT License
964 stars 56 forks source link

Globstar in an or pattern seems to behave inconsistently with uses outside or. #88

Closed srdo closed 2 years ago

srdo commented 3 years ago

Hi.

According to the docs, globstar will only match path separators if ** are the only characters in a path segment. The following match returns true as expected, since ** are the only characters in the last path segment.

pm.isMatch("dir1/sub/sub", "dir1/**")

Adding an or to this pattern breaks it, and I'm wondering if it's because the or is being "counted as" another segment?

pm.isMatch("dir1/sub/sub", "dir1/**|dir2/**") -> false
pm.isMatch("dir2/sub/sub", "dir1/**|dir2/**") -> true
pm.isMatch("dir1/sub", "dir1/**|dir2/**") -> true, so the globstar seems to now not match / anymore
conartist6 commented 2 years ago

I just got burned by this too. I need it to work, so I'll see what I can figure out.

conartist6 commented 2 years ago

I think the bug must be in the parser somewhere. My minimal repro is:

assert(isMatch('a/b', '(x|**)'));

That expression is parsed as follows:

[
  { "type": "bos", "value": "", "output": "" },
  { "type": "paren", "value": "(" },
  { "type": "text", "value": "x|", "output": "|" },
  { "type": "star", "value": "*", "output": "[^/]*?" },
  { "type": "star", "value": "*", "output": "" },
  { "type": "paren", "value": ")", "output": ")" }
]
conartist6 commented 2 years ago

Compare that with pm.parse('(**)'):

[
  { "type": "bos", "value": "", "output": "" },
  { "type": "paren", "value": "(" },
  {
    "type": "globstar",
    "value": "**",
    "output": "(?:(?:(?!(?:^|\\\\/)\\\\.).)*?)"
  },
  { "type": "paren", "value": ")", "output": ")" }
]
conartist6 commented 2 years ago

And pm.parse('(x|y)'):

[
  { "type": "bos", "value": "", "output": "" },
  { "type": "paren", "value": "(" },
  { "type": "text", "value": "x|y", "output": "|y" },
  { "type": "paren", "value": ")", "output": ")" }
]

Something else is clearly very odd about this, since the output would appear to be invalid, i.e. no part of the reported output would ever match x. In practice though x can be matched by (x|y).

conartist6 commented 2 years ago

I wonder why | is always treated as type: 'text' by the parser: https://github.com/micromatch/picomatch/blob/5467a5a9638472610de4f30709991b9a56bb5613/lib/parse.js#L616-L622

conartist6 commented 2 years ago

I'm also a bit suspicious about this: https://github.com/micromatch/picomatch/blob/5467a5a9638472610de4f30709991b9a56bb5613/lib/parse.js#L825-L828

conartist6 commented 2 years ago

It looks to me like what should happen is we should fall through a couple cases and end up here: https://github.com/micromatch/picomatch/blob/5467a5a9638472610de4f30709991b9a56bb5613/lib/parse.js#L909

Instead we're falling into this case: https://github.com/micromatch/picomatch/blob/5467a5a9638472610de4f30709991b9a56bb5613/lib/parse.js#L837-L840

I suspect the confusing naming scheme with prev, prior and before has led to confusion about what contains what.

jonschlinkert commented 2 years ago

Adding an or to this pattern breaks it, and I'm wondering if it's because the or is being "counted as" another segment?

This isn't a bug, the pattern you're using isn't a supported glob. It's just creating a regular expression that happens to not match what you want. Try wrapping the segment in parentheses like this:

pm.isMatch('dir1/sub/sub', 'dir1/(**|dir2)/**');
pm.isMatch('dir2/sub/sub', 'dir1/(**|dir2)/**');
pm.isMatch('dir1/sub', 'dir1/(**|dir2)/**');

Or you can use braces, like: dir1/{**,dir2}/**.

conartist6 commented 2 years ago

@jonschlinkert Could you take another look at this? The example pattern you've provided always matches everything, and you seem to be asking me to do exactly what I am already doing that does not work.

Again the failing test case is:

pm.isMatch('a/b', '(x|**)'); // false

and it fails because the ** is interpreted as if it were a * for no reason that I understand.

conartist6 commented 2 years ago

Your parser also has different output for (**|x) and (x|**):

> pm.parse('(**|x)').tokens.map(({prev, ...rest}) => rest)
[
  { type: 'bos', value: '', output: '' },
  { type: 'paren', value: '(' },
  { type: 'star', value: '*', output: '[^/]*?' },
  { type: 'text', value: '|x', output: 'x' },
  { type: 'paren', value: ')', output: ')' }
]
> pm.parse('(x|**)').tokens.map(({prev, ...rest}) => rest)
[
  { type: 'bos', value: '', output: '' },
  { type: 'paren', value: '(' },
  { type: 'text', value: 'x|', output: '|' },
  { type: 'star', value: '*', output: '[^/]*?' },
  { type: 'star', value: '*', output: '' },
  { type: 'paren', value: ')', output: ')' }
]
conartist6 commented 2 years ago

Ah I'm sorry, I guess I've sort of co-opted this issue. You're saying that OPs code would work if they just added parens, and I'm saying that parens or no it's still gonna be at least partly busted (and busted in different ways depending on the order of the ** within the alternative group).

conartist6 commented 2 years ago

Sorry it's a bit off topic, but it's late and I'm having fun. I think I'm going to try my hand at rewriting this parser using a technique I've been building slowly towards. My technique would create a fully streaming (!) n-lookahead parser with top-down control and easy and expressive error handling. An implementation of parserate (to be built on top of the existing forkerate) would allow a parser to be written like so:

function *tokens(iterable, options) {
  const parsr = parserate(iterable);

  while (!parsr.done) {
    if (parsr.consume(/\w+/), 'text') {
      yield {
        type: 'text',
        value: parsr.consumed,
      }
    } else if (parsr.consume(/\d+/), 'digit') {
      yield {
        type: 'digit',
        value: parseInt(parsr.consumed, 10),
      }
    } else if (options.backreferences && parsr.consume(/\{(\d+)\}/, 'backreference')) {
      yield {
        type: 'backreference',
        value: parseInt(parsr.consumedMatch[1], 10),
      }
    } else {
      throw parsr.error(`Unexpected character: '${parsr.value}'`);
      // SyntaxError: Unexpected character: '&'
      // Expected one of: text, digit
      // line: 1,
      // col: 17
    }
  }
}
srdo commented 2 years ago

@jonschlinkert You are right, I was missing some parentheses. Changing the pattern to (dir1/**)|(dir2/**) did the trick to express "paths starting with dir1/ or dir2/". Thanks for the hint.

@conartist6 If your original pattern is intended to express "x or nonempty string", it looks like (x)|(**) does that.

I'll close this issue, I'm hoping this hint also solves your problem Conrad, but either way it's probably better if you raise a separate issue for it :)

conartist6 commented 2 years ago

I just want to make clear for anyone else that happens across this that the problem described by the title of this issue is not currently resolved, and is now tracked in #104.