How to ignore multiple directories? [Performance is slow]

borekb commented 6 years ago

I want to lint our monorepo which requires excluding node_modules and vendor directories (they contain a lot of Markdown files which are not our source code).

So far, I've tried this with no luck:

# All these don't work

# 1
markdownlint -i '**/(node_modules|vendor)/**' .

# 2
markdownlint -i '(node_modules|vendor)' .

# 3
markdownlint -i '**/node_modules/**' -i '**/vendor/**' .

# 4
markdownlint '**/!(node_modules|vendor)/**/*.md'

I especially think that #4 should work.

borekb commented 6 years ago

No, sorry, the fourth one probably shouldn't work, at least I cannot get it working in globtester.

borekb commented 6 years ago

This works but is extremely slow:

markdownlint -i '{**/vendor/**,**/node_modules/**}' .

Without the ignore rules, glob.sync takes under 5 seconds to come up with the list (already quite slow but maybe that's the way things are). With the glob above, it takes over 50 seconds on my machine.

I'm not sure if I'm doing something wrong but my use case doesn't seem to be that strange; either I'm not finding the right glob pattern or the current implementation is quite inefficient.

DavidAnson commented 6 years ago

Do you just want to exclude those directories at the top level, or is it necessary to do so recursively? Is your repo public so we can see the structure and experiment?

borekb commented 6 years ago

The repo is not public but its structure is basically just a collection of folders, so the ignoring needs to be recursive. It looks a bit like this:

- projectA
    - README.md
    - node_modules
- subdir
    - projectB
        - node_modules
        - docs.md
    - projectC
        - node_modules
        - docs
            - a.md
            - b.md
- project X
    - vendor
    - site
        - page1.md
        - page2.md

Actually, versionpress/versionpress is sort of like that as well:

DavidAnson commented 6 years ago

I have some ideas, but need to experiment. I’ll follow up after I get a chance to do that.

DavidAnson commented 6 years ago

Looking only at correctness, your option 3 (the simple, obvious one) should work - and it does for me on Windows. (The others may also work for me, but I tried this one first.) Could you please try it again? Is there maybe something going where your Unix shell is expanding the glob (though I see you've quoted it correctly)?

Here's my scenario:

C:\T\mdlcli44>tree /f /a
+---project X
|   +---site
|   |       page1.md
|   |       page2.md
|   \---vendor
|           BAD.md
+---projectA
|   |   README.md
|   \---node_modules
|           BAD.md
\---subdir
    +---projectB
    |   |   docs.md
    |   \---node_modules
    |           BAD.md
    \---projectC
        +---docs
        |       a.md
        |       b.md
        \---node_modules
                BAD.md

C:\T\mdlcli44>markdownlint .
project X/vendor/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
projectA/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
subdir/projectB/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
subdir/projectC/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]

C:\T\mdlcli44>markdownlint --ignore "**/node_modules/**" .
project X/vendor/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]

C:\T\mdlcli44>markdownlint --ignore "**/node_modules/**" --ignore "**/vendor/**" .

C:\T\mdlcli44>

borekb commented 6 years ago

Thanks, @DavidAnson, you're right that option 3 indeed works. It was just so slow that I assumed it hanged and killed it :)

Interestingly, having a single --ignore option is faster than multiple ones:

$ time markdownlint -i '{**/vendor/**,**/node_modules/**}' .
0.00s user 0.03s system 0% cpu 2:51.09 total

$ time markdownlint -i '**/vendor/**' -i '**/node_modules/**' .
0.01s user 0.01s system 0% cpu 3:35.35 total

40 second difference is quite big. Overall, there is certainly some performance issue as the actual linting is done quickly.

DavidAnson commented 6 years ago

Cool, thanks! Now that we have established it is working as expected, I will look at performance. As currently implemented, each glob runs separately and crawls the entire tree. This makes the code simple, but is worse for performance than combining them somehow. Additionally, each glob remembers all the matching files which is bad for memory. I think if you add “*.md” to the ignore globs so they only remember Markdown files, it will run a bit faster. (But I have not measured it yet.)

DavidAnson commented 6 years ago

Some updates:

I have a plan to remove the extra globs, so there's only ever one. This should make a noticeable difference.

However, there seems to be a bigger bug at play that makes --ignore slower than it should be. I don't know what it is yet, but this is my new top priority. It could be even more significant than removing the extra globs.

Finally, for fun, in a normal (no ignore) run against a directory of about 10 Node apps and dev dependencies, the time breakdown is roughly:

40% globbing for files
38% parsing and linting
3% sorting results
6% printing results

DavidAnson commented 6 years ago

The crazy behavior I was seeing above was because of how I was connecting the profiler and isn't real. So now I can measure the time for the run above, but this time with --ignore "**/node_modules/**" --ignore "**/vendor/**":

97% globbing
2% sorting
1% linting and printing

This suggests getting rid of the extra globs should be very helpful, so I'll continue looking into that.

DavidAnson commented 6 years ago

I think I'll have good news for you tomorrow. Performance of --ignore should be much faster in the next release. 😄 I'll share measurements once I've cleaned up the prototype.

borekb commented 6 years ago

Awesome, I think I was also seeing something like the 97/2/1 split. Looking forward to vnext :)

DavidAnson commented 6 years ago

The commit I'm about to make reduces the time required for the above test scenario (with --ignore) from about 44 seconds down to just under 9 seconds.

I expect a follow-up commit to refactor a bit, but the benefit won't be as significant.

borekb commented 6 years ago

That's a great time reduction, thanks! I unfortunately don't understand globbing / file iterations well enough to help with this but I know tools like ripgrep process ignore rules in a way that you won't even notice them (that tool, specifically, follows the .gitignore rules by default which I think is a slick approach).

DavidAnson commented 6 years ago

I’m hesitant to ignore anything by default, but what do you think about a parameter-less --gitignore option that would do what you suggest? Would that have worked in your case?

borekb commented 6 years ago

Yes, that would be nice. I'd combine it with the --ignore flag as two Markdown files in our repo need to be excluded from linting, but I assume the two filters would work together? Certainly not an essential feature though, I just thought it's nice in ripgrep. (The more important thing from ripgrep, IMO, is the speed.)

DavidAnson commented 6 years ago

Could you please open an issue for this? Yes, both flags should work together. I won’t do it now, but it’s a great candidate for a future release!

borekb commented 6 years ago

👍 , https://github.com/igorshubovych/markdownlint-cli/issues/46.

borekb commented 6 years ago

By the way, do you think I should also open a new performance issue for another order(s) of magnitude improvement? I think that it should be possible to get to under a second (though I don't have actual knowledge on how to achieve that).

DavidAnson commented 6 years ago

With yesterday’s commit, there is now only ever one glob that runs against the file system, so it is optimal in that respect. But it will waste time going into directories that will later be ignored, so that is an opportunity for improvement. I don’t think the glob library supports that, so it may be necessary to find another that does to fix this. It seems like a fine suggestion to capture, but may not be easy.

borekb commented 6 years ago

My feeling is that glob.sync will be the next bottleneck. I'll wait for the new release to test out the current performance improvements, and maybe file an issue then. Thanks!

igorshubovych / markdownlint-cli

How to ignore multiple directories? [Performance is slow] #44