remarkjs / remark-lint-no-dead-urls

lint rule to warn when URLs are dead
https://unifiedjs.com
MIT License
78 stars 13 forks source link

Missing files in remark-cli macOS pattern resolution #52

Closed karlhorky closed 2 months ago

karlhorky commented 2 months ago

Initial checklist

Affected packages and versions

remark-cli@12.0.1, remark-lint@10.0.0, remark-lint-no-dead-urls@2.0.0

Link to runnable example

https://github.com/karlhorky/repro-remark-lint-no-dead-urls-folder-patterns

Steps to reproduce

  1. Check out the reproduction at https://github.com/karlhorky/repro-remark-lint-no-dead-urls-folder-patterns
  2. Check out the failing test run on macos-latest (GitHub Actions workflow file) and observe the missed files in patterns remark . ..., remark ./src/pages/ ..., remark src/pages/**/*.mdx ... etc on macOS (only paths to singular files seem to be working on macOS)

This doesn't happen on ubuntu-latest

This didn't happen in versions before remark-lint-no-dead-urls@2.0.0

Runtime: Node.js latest LTS (20.17.0)

Expected behavior

Patterns are resolved to multiple files

Actual behavior

Patterns are resolved to a single file or zero files

Alternatives Considered

Maybe this is a general problem in remark-cli upstream? (eg. maybe this issue should be transferred to remarkjs/remark issues instead?)

Runtime

Other (please specify in steps to reproduce)

Package manager

pnpm

OS

macOS

Build and bundle tools

No response

ChristianMurphy commented 2 months ago

Hey @karlhorky! 👋 remark-cli by default only matches markdown extensions. https://github.com/remarkjs/remark/blob/97e6f149976e486b5008df073d27f0507d0e58a4/packages/remark-cli/cli.js#L32 the list of default extensions is https://github.com/sindresorhus/markdown-extensions/blob/797fc3dda4ecc9aab668bfdd745f570593f1581d/index.js#L4-L11

If you want to include MDX add the --ext mdx option to your matcher commands. Hope this helps!


This doesn't happen on ubuntu-latest

I suspect this is globstar related. Try setting

shopt -s globstar

On your mac shell.


This didn't happen in versions before remark-lint-no-dead-urls@2.0.0

This shouldn't be possible. This plugin isn't path aware. I strongly suspect you updated remark-cli at the same time.


In addition to address potential follow ups:

github-actions[bot] commented 2 months ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

karlhorky commented 2 months ago

This didn't happen in versions before remark-lint-no-dead-urls@2.0.0

This shouldn't be possible. This plugin isn't path aware. I strongly suspect you updated remark-cli at the same time.

Nope, the PR only updated from remark-lint-no-dead-urls@1.1.0 to remark-lint-no-dead-urls@2.0.0

Screenshot 2024-09-28 at 16 58 37

Screenshot 2024-09-28 at 16 57 44

Screenshot 2024-09-28 at 17 01 06

Been working for literally years before the bump to remark-lint-no-dead-urls@2.0.0

Can create a PR to my reproduction repo if you need hard proof.

karlhorky commented 2 months ago

If you want to include MDX add the --ext mdx option to your matcher commands.

Looks like --ext mdx is what is missing, thanks!

https://github.com/karlhorky/repro-remark-lint-no-dead-urls-folder-patterns/pull/1

So it looks like, on macOS (without any further configuration or installing anything):

  1. ., ./src/pages/ and "src/pages/**/*.mdx" resolve all files correctly (incl. nested files) ✅
  2. src/pages/**/*.mdx (without quotes) does not resolve files correctly (misses nested files) ❌

So on macOS, the quotes are important for remark-cli

ChristianMurphy commented 2 months ago

Nope, the PR only updated from remark-lint-no-dead-urls@1.1.0 to remark-lint-no-dead-urls@2.0.0

@karlhorky your screenshot shows 8 files being updated, it's doing a lot more than just updating a dependency.

So on macOS, the quotes are important for remark-cli

When the glob isn't quoted it is handled by your shell. You are using a bash shell without globstar enabled, if you want to use unquoted globs. Enabled globstar

shopt -s globstar
karlhorky commented 2 months ago

your screenshot shows 8 files being updated, it's doing a lot more than just updating a dependency.

Those are other files that I changed myself, also not related to updating any dependencies. It was me trying different options to fix the problem of files being missed. But I tested it first without those other unrelated changes.

ChristianMurphy commented 2 months ago

I still strongly suspect that remark-cli/unified-engine/glob were updated through that PR. Given that the search you did shows remark-cli is in the diff somewhere off screen from what you did a screenshot of, it is still the most likely case.

karlhorky commented 2 months ago

It was in the previous package.json screenshot, that was the 1 occurrence (the line was unchanged)

But maybe something else caused a different remark-lint version to be used, I can't be 100% certain of that.

Not in the diff of the PR though.

karlhorky commented 2 months ago

I also tried out options.extensions to see if I can avoid having --ext mdx in the CLI command, but it looks like that's not possible in a Remark config (maybe I have the wrong types for the Remark config):

remark-lint-no-dead-urls.mjs

/** @type {Omit<import('unified-engine').Options, 'processor'>} */
const config = {
  extensions: ['mdx'],
  plugins: [
    'remark-mdx',
    [
ChristianMurphy commented 2 months ago

Looks like --ext mdx is what is missing, thanks!

I'm glad it works. I'm a bit surprised it ever worked without that option.

I also tried out options.extensions to see if I can avoid having --ext mdx in the CLI command, but it looks like that's not possible in a Remark config (maybe I have the wrong types):

options.extensions is part of creating a CLI https://github.com/remarkjs/remark/blob/97e6f149976e486b5008df073d27f0507d0e58a4/packages/remark-cli/cli.js#L30-L43 It is not part of a configuration file.

The type for a config is Preset https://github.com/unifiedjs/unified-engine/blob/main/readme.md#preset

karlhorky commented 2 months ago

Looks like --ext mdx is what is missing, thanks!

I'm glad it works. I'm a bit surprised it ever worked without that option.

Yeah, maybe what the old version of remark-cli did was read the pattern for .<extension> and add that to the extensions option...

karlhorky commented 2 months ago

The type for a config is Preset unifiedjs/unified-engine@main/readme.md#preset

Thanks! I'll take a look at using this type instead. Does it also have the frail boolean property? It's not documented at that section... But maybe it's somewhere else, eg. under import('unified').Settings

Edit: hm... import('unified').Settings is a very permissive type:

  /**
   * Configuration passed to a Plugin or Processor
   */
  interface Settings {
    [key: string]: unknown
  }

Would be great to have some type checking for typos like fraiil: true 🤔 Also, seems that it's not happy with top-level frail property


/** @type {import('unified-engine').Preset} */
const config = {
  settings: {
    frail: true,
    fraiil: true, // 💥 No error on typo
  },

  frail: true, // 💥 Object literal may only specify known properties, and 'frail' does not exist in type 'PresetSupportingSpecifiers'.

Edit 2: I think that's why I went with the original type of Omit<import('unified-engine').Options, 'processor'> - because it has the frail property and other things too:

https://github.com/unifiedjs/unified-engine/blob/0266c1abe3c54b736bae3f73b9f235083b466736/lib/index.js#L72-L74

So I think I'll stick with Omit<import('unified-engine').Options, 'processor'> for now, after I confirm that frail: true at top level is working...

Edit 3:

It seems that frail: true is not a thing for a Remark config file, doesn't trigger the same behavior as the --frail config flag. So I'll switch back to the import('unified-engine').Preset type.

ChristianMurphy commented 2 months ago

hm... import('unified').Settings is a very permissive type:

Settings are per-pipeline, and depend a lot on the parser and compiler used. There also can be multiple distinct pipelines in a project, so attaching the type to a globally shared setting interface (similar to how mdast and hast can be extended with new types), would be less viable than the node type scenario. I suspect trying to infer these will run into similar issues as https://github.com/unifiedjs/unified/issues/221

karlhorky commented 2 months ago

Weird, so in my reproduction repo (ubuntu-latest), it's finding the files in all cases except src/pages/**/*.mdx without quotes

But then in our private monorepo on GitHub Actions (many more links), it succeeds silently without any output, although I know that there are broken links contained in some of the files:

$ remark "./src/pages/**/*.mdx" --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
Done in 57.25s.

The broken links show up on my local macOS development environment though 🤔

$ remark "./src/pages/**/*.mdx" --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
src/pages/x/n/d/thank-u-next/index.mdx
486:3-486:116 warning Link to https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/ is dead                    no-dead-urls remark-lint

⚠ 24 warnings
error Command failed with exit code 1.

I've tried patterns with ./src/pages/, ./src/pages, all with the --ext mdx ... seems like there's few differences left that I can think of (except argument order in the command, which I'm assuming doesn't matter).

I also thought it could be a path length issue, but that didn't cause the repro to stop working:

Would be cool to be able to show the full list of URLs tested, even if they are passing - I tried remark --verbose for this, but it didn't change anything about the output.

Maybe that's a feature request for remark-lint-no-dead-urls though...

ChristianMurphy commented 2 months ago

it's finding the files in all cases except src/pages/*/.mdx without quotes

Even on ubuntu, this is controlled by your shell. The distro being used may not have globstar enabled. run:

shopt -s globstar

But then in our private monorepo on GitHub Actions (many more links), it succeeds silently without any output, although I know that there are broken links contained in some of the files:

Check how you are running the script, and what the current wording directory cwd is. It sounds like you have the script pointed to the wrong folder.

Would be cool to be able to show the full list of URLs tested

I'm pretty sure the issue is folder related, rather than URL. Adding this feature wouldn't really help you debug the issue you are having.

karlhorky commented 2 months ago

Interesting:

  1. with a single file path (eg. src/a/b/c/d/index.mdx), it works in the monorepo, showing the error message (although it also puts out the file content to stdout, maybe this is a standard behavior of remark-cli with a single file?)

Run time: 5m 20s

$ remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/index.mdx --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
***

title: "Thank U, Next"
sortOrder: 11
curriculumModuleCategory: '{"id":2,"title":"Lectures","symbol":"🧑‍🏫"}'
------------------------------------------------------------------------

# Thank U, Next

... more file content here ...

...

486:3-486:116 warning Unexpected dead URL `[https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/`](https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/%60), expected live URL                                                                                                                                             no-dead-urls remark-lint
  [cause]:
              error   Unexpected not ok response `404` (`Not Found`) on `[https://chromewebstore.google.com/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/`](https://chromewebstore.google.com/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/%60)                                                                                                                                   dead         dead-or-alive

⚠ 13 warnings
error Command failed with exit code 1.
  1. it starts failing as soon as I remove the index.mdx (eg. src/a/b/c/d/), although I think it's still linting the files, because it still takes a long time, just doesn't show the output and doesn't exit with code 1

Run time: 10m 27s

$ remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/ --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs
Done in 627.86s.

Not sure why it would take twice as long though, since there is only the index.mdx file in that directory...

ChristianMurphy commented 2 months ago

(although it also puts out the file content to stdout, maybe this is a standard behavior of remark-cli with a single file?)

It is, if you want to turn that off for single files use --no-stdout https://github.com/unifiedjs/unified-args?tab=readme-ov-file#--stdout

Not sure why it would take twice as long though, since there is only the index.mdx file in that directory...

Usually when it takes significantly longer when using globs/patterns. There is a hidden/ignored folder like node_modules, .git, or another tool that has a large number of files that would take time to scan. Taking a long time does not mean it is looking at the folder you are expecting.

karlhorky commented 2 months ago

Update: I have it working with the folder path now (the one like src/a/b/c/d/ in 2 above)

I did this by removing a large amount of the content in the (long) MDX file.

So it appears that it's related to MDX content in the file - maybe something invalid or some link that is causing remark-cli / remark-lint / remark-lint-no-dead-urls to fail silently when there's more than 1 file specified (but does not fail when only 1 is specified)

ChristianMurphy commented 2 months ago

So it appears that it's related to MDX content in the file

What indicates it is the MDX content specifically, over the length of the file?

Another worth while question, how many URLs are in that document? And do any of them have rate limiting or DDoS protection applied?

The functional attempts to access all urls in the document simultaneously. https://github.com/remarkjs/remark-lint-no-dead-urls/blob/c1365e1becd238b82c88804888e8f0ebdcd72abe/lib/index.js#L148-L149 With a sufficiently large document you could be triggering your own rate limiting/DDoS protections.

when there's more than 1 file specified

Again, this plugin can literally only see a single file at a time https://github.com/remarkjs/remark-lint-no-dead-urls/blob/c1365e1becd238b82c88804888e8f0ebdcd72abe/lib/index.js#L57

karlhorky commented 2 months ago

What indicates it is the MDX content specifically, over the length of the file?

I guess it could also be file length (485 lines), but I was not expecting that. To reiterate, the entire file content is not a problem when specifying a path directly to the file eg. remark src/a/b/c/d/index.mdx --ext mdx ...

It appears that the problematic bits are code blocks in the file.

Another worth while question, how many URLs are in that document? And do any of them have rate limiting or DDoS protection applied?

There are 60 URLs in the file, and blocking is not a problem when I use a file path directly to the file.

Again, this plugin can literally only see a single file at a time

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

Again, this plugin can literally only see a single file at a time

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

ChristianMurphy commented 2 months ago

It appears that the problematic bits are code blocks in the file.

Do the code blocks contain a url attribute?

https://github.com/remarkjs/remark-lint-no-dead-urls/blob/c1365e1becd238b82c88804888e8f0ebdcd72abe/lib/index.js#L101

Could you share an anonymized/non-proprietary version of the file?

Oh maybe I was misunderstanding earlier - are you saying that the directory path pattern I'm using above remark ./src/a/b/c/d/ --ext mdx ... won't work?

I'm saying that unified-engine handles passing files one at a time to the plugins. The number of files shouldn't have an impact on how this plugin works.

To reiterate, the entire file content is not a problem when specifying a path directly to the file

@wooorm could that change the file meta data? There is one block that depends on the file meta data in this project. https://github.com/remarkjs/remark-lint-no-dead-urls/blob/c1365e1becd238b82c88804888e8f0ebdcd72abe/lib/index.js#L84

@karlhorky are you testing any relative links which would depend on this?

karlhorky commented 2 months ago

Interesting, it seems the behavior is "flaky" - when running the command below, sometimes remark-lint-no-dead-urls exits with no output and a 0 exit code, and then other times, it shows the broken links output.

remark src/pages/tech-fundamentals-foundations-immersive/modules/thank-u-next/index.mdx --ext mdx --frail --rc-path remark-lint-no-dead-urls.mjs

The one Chrome extension link https://chrome.google.com/webstore/detail/wildfire/djhgeeodemlfdpmcccdekfalbhllcoim/ is definitely broken always though, so not sure why it would ever report no output with a 0 exit code.

Here is an example of some of the HTML that we have in the code blocks (unusual HTML 3.2 code, teaching about legacy HTML):


16. Right click on the new `legacy` folder and create a new file called `index.html` with the following content:

    ```html
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html>
    <head>
      <title>TodoMVC</title>
    </head>
    <body>
    <script>
    function rerender() {
        frames[1].location.reload();
    }

    function click(id) {
        for (var i = 0; i < state.numTodos; i++) {
            if (state.todos[i].id == id) {
                state.todos[i].complete = !state.todos[i].complete;
                rerender();
                return;
            }
        }
    }

    function addTodo(text) {
        var nextId = state.nextId;
        state.nextId++;
        state.todos[state.numTodos] = makeTodo(nextId, text, false);
        state.numTodos++;
        rerender();
    }

    function clearCompleted() {
        var newTodos = new Object();
        var newNum = 0;
        for (var i = 0; i < state.numTodos; i++) {
            if (!state.todos[i].complete) {
                newTodos[newNum] = state.todos[i];
                newNum++;
            }
        }
        state.todos = newTodos;
        state.numTodos = newNum;
        rerender();
    }

    function filter(newFilter) {
        state.filter = newFilter;
        rerender();
    }

    function makeTodo(id, label, complete) {
        var todo = new Object();
        todo.id = id;
        todo.label = label;
        todo.complete = complete;
        return todo;
    }

    var state = new Object();
    state.filter = 'all';
    state.todos = new Object();
    state.numTodos = 0;
    state.nextId = 1;
    </script>
    <FRAMESET rows="75,*" border="0" frameborder="no" framespacing="0">
        <FRAME src="header.html" scrolling="no">
        <FRAME name=content src="todos.html" scrolling="no">
    </FRAMESET></body>
    </html>


Parts of the code are also highlighted red in the HTML highlighting in VS Code, because of the invalid / legacy code:

![Screenshot 2024-09-29 at 00 42 08](https://github.com/user-attachments/assets/432f61c2-614a-45e8-8a3d-b74e053ba08c)

Maybe I should try running it again 5 or 10 times with the direct path to the file, with unchanged file content - maybe `remark-lint-no-dead-urls@2.0.0` just has some generally flaky behavior, and it has nothing to do with the path containing the `/index.mdx` on the end or just a `/` ... 🤔

Edit: Since I saw that there were `undici` paths in the warning / error messages from `remark-lint-no-dead-urls`, I checked into that, and I found that I had a slightly outdated version in my tree (`undici@6.19.7` instead of `undici@6.19.8`). I re-generated my lockfile and am retesting now... first test run is showing failures now. So if this leads somewhere, it opens a new possibility - maybe it is a bug that's already fixed in a transitive dependency...

Edit 2: No, upgrading to `undici@6.19.8` didn't seem to resolve the problem. `remark-lint-no-dead-urls@2.0.0` is still intermittently, silently failing with no stdout (also across different test runs when making no content changes or path changes). I think I'll hold off on upgrading for now, not sure there's much else I can do to debug without diving into the code of `remark-lint-no-dead-urls@2.0.0` vs `remark-lint-no-dead-urls@1.1.0`
karlhorky commented 2 months ago

Ok, that was actually easier than I thought to reproduce - just run the lint command multiple times in the GitHub Actions workflow and it immediately shows the intermittent false negatives / flaky behavior: