Closed borekb closed 6 years ago
No, sorry, the fourth one probably shouldn't work, at least I cannot get it working in globtester.
This works but is extremely slow:
markdownlint -i '{**/vendor/**,**/node_modules/**}' .
Without the ignore rules, glob.sync
takes under 5 seconds to come up with the list (already quite slow but maybe that's the way things are). With the glob above, it takes over 50 seconds on my machine.
I'm not sure if I'm doing something wrong but my use case doesn't seem to be that strange; either I'm not finding the right glob pattern or the current implementation is quite inefficient.
Do you just want to exclude those directories at the top level, or is it necessary to do so recursively? Is your repo public so we can see the structure and experiment?
The repo is not public but its structure is basically just a collection of folders, so the ignoring needs to be recursive. It looks a bit like this:
- projectA
- README.md
- node_modules
- subdir
- projectB
- node_modules
- docs.md
- projectC
- node_modules
- docs
- a.md
- b.md
- project X
- vendor
- site
- page1.md
- page2.md
Actually, versionpress/versionpress
is sort of like that as well:
I have some ideas, but need to experiment. I’ll follow up after I get a chance to do that.
Looking only at correctness, your option 3 (the simple, obvious one) should work - and it does for me on Windows. (The others may also work for me, but I tried this one first.) Could you please try it again? Is there maybe something going where your Unix shell is expanding the glob (though I see you've quoted it correctly)?
Here's my scenario:
C:\T\mdlcli44>tree /f /a
+---project X
| +---site
| | page1.md
| | page2.md
| \---vendor
| BAD.md
+---projectA
| | README.md
| \---node_modules
| BAD.md
\---subdir
+---projectB
| | docs.md
| \---node_modules
| BAD.md
\---projectC
+---docs
| a.md
| b.md
\---node_modules
BAD.md
C:\T\mdlcli44>markdownlint .
project X/vendor/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
projectA/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
subdir/projectB/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
subdir/projectC/node_modules/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
C:\T\mdlcli44>markdownlint --ignore "**/node_modules/**" .
project X/vendor/BAD.md: 1: MD041/first-line-h1 First line in file should be a top level heading [Context: "No heading"]
C:\T\mdlcli44>markdownlint --ignore "**/node_modules/**" --ignore "**/vendor/**" .
C:\T\mdlcli44>
Thanks, @DavidAnson, you're right that option 3 indeed works. It was just so slow that I assumed it hanged and killed it :)
Interestingly, having a single --ignore
option is faster than multiple ones:
$ time markdownlint -i '{**/vendor/**,**/node_modules/**}' .
0.00s user 0.03s system 0% cpu 2:51.09 total
$ time markdownlint -i '**/vendor/**' -i '**/node_modules/**' .
0.01s user 0.01s system 0% cpu 3:35.35 total
40 second difference is quite big. Overall, there is certainly some performance issue as the actual linting is done quickly.
Cool, thanks! Now that we have established it is working as expected, I will look at performance. As currently implemented, each glob runs separately and crawls the entire tree. This makes the code simple, but is worse for performance than combining them somehow. Additionally, each glob remembers all the matching files which is bad for memory. I think if you add “*.md” to the ignore globs so they only remember Markdown files, it will run a bit faster. (But I have not measured it yet.)
Some updates:
I have a plan to remove the extra globs, so there's only ever one. This should make a noticeable difference.
However, there seems to be a bigger bug at play that makes --ignore
slower than it should be. I don't know what it is yet, but this is my new top priority. It could be even more significant than removing the extra globs.
Finally, for fun, in a normal (no ignore) run against a directory of about 10 Node apps and dev dependencies, the time breakdown is roughly:
The crazy behavior I was seeing above was because of how I was connecting the profiler and isn't real. So now I can measure the time for the run above, but this time with --ignore "**/node_modules/**" --ignore "**/vendor/**"
:
This suggests getting rid of the extra globs should be very helpful, so I'll continue looking into that.
I think I'll have good news for you tomorrow. Performance of --ignore
should be much faster in the next release. 😄 I'll share measurements once I've cleaned up the prototype.
Awesome, I think I was also seeing something like the 97/2/1 split. Looking forward to vnext :)
The commit I'm about to make reduces the time required for the above test scenario (with --ignore
) from about 44 seconds down to just under 9 seconds.
I expect a follow-up commit to refactor a bit, but the benefit won't be as significant.
That's a great time reduction, thanks! I unfortunately don't understand globbing / file iterations well enough to help with this but I know tools like ripgrep process ignore rules in a way that you won't even notice them (that tool, specifically, follows the .gitignore
rules by default which I think is a slick approach).
I’m hesitant to ignore anything by default, but what do you think about a parameter-less --gitignore
option that would do what you suggest? Would that have worked in your case?
Yes, that would be nice. I'd combine it with the --ignore
flag as two Markdown files in our repo need to be excluded from linting, but I assume the two filters would work together? Certainly not an essential feature though, I just thought it's nice in ripgrep. (The more important thing from ripgrep, IMO, is the speed.)
Could you please open an issue for this? Yes, both flags should work together. I won’t do it now, but it’s a great candidate for a future release!
By the way, do you think I should also open a new performance issue for another order(s) of magnitude improvement? I think that it should be possible to get to under a second (though I don't have actual knowledge on how to achieve that).
With yesterday’s commit, there is now only ever one glob that runs against the file system, so it is optimal in that respect. But it will waste time going into directories that will later be ignored, so that is an opportunity for improvement. I don’t think the glob
library supports that, so it may be necessary to find another that does to fix this. It seems like a fine suggestion to capture, but may not be easy.
My feeling is that glob.sync
will be the next bottleneck. I'll wait for the new release to test out the current performance improvements, and maybe file an issue then. Thanks!
I want to lint our monorepo which requires excluding
node_modules
andvendor
directories (they contain a lot of Markdown files which are not our source code).So far, I've tried this with no luck:
I especially think that
#4
should work.