mdn / content

The content behind MDN Web Docs
https://developer.mozilla.org
Other
9.17k stars 22.46k forks source link

Broken URL fragments in the repo #25421

Closed OnkarRuikar closed 3 months ago

OnkarRuikar commented 1 year ago

There are many broken URL fragments in mdn/content.\ Almost all the time when a header(## header) gets modified, contributors don't update related fragments all over the repo.

Following is the list of broken fragments data: https://gist.github.com/OnkarRuikar/fa14827682a37f04f23c49189bfa0701

teoli2003 commented 1 year ago

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

OnkarRuikar commented 1 year ago

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

But then we'll get multiple commits on each file :)\ If that is not an issue then I've added sorted list of unique urls to OP.

teoli2003 commented 1 year ago

Note that if you group them by broken fragments (rather than by page with a broken link), we can look for one and solve all occurrences, and the list can be resolved more quickly.

As long as we don't do one commit per flaw… I think this is ok: there is about 20 times the same broken link to HTTP/Cache#varying… If all are fixed in the same PR, it is quick to review, and I don't mind that the other flaws in these files are solved separately.

Josh-Cena commented 3 months ago

@OnkarRuikar: I don't think this issue can be kept up to date. Can you share the script used to generate the gists and every time we try to fix something we can run the script?

OnkarRuikar commented 3 months ago

every time we try to fix something we can run the script?

I don't think this issue can be kept up to date.

We don't have to. The current script prevents new occurrences from happening. The gist still hold true, e.g. content/files/en-us/learn/tools_and_testing/client-side_javascript_frameworks/ember_interactivity_events_state/index.md Learn/JavaScript/Building_blocks/Events#addeventlistener_and_removeeventlistener is still broken.

Josh-Cena commented 3 months ago

What about anchors that happen to be fixed, though? For example, the very first item, "content/files/en-us/glossary/script-supporting_element/index.md Web/HTML/Kinds_of_HTML_content#script-supporting_elements", was fixed in #28000.

Josh-Cena commented 3 months ago

Anyway I'll drop my script used in https://github.com/orgs/mdn/discussions/279:

import fs from "node:fs/promises";
import path from "node:path";
import { load } from "cheerio";

async function* getFiles(dir) {
  const dirents = await fs.readdir(dir, { withFileTypes: true });
  for (const dirent of dirents) {
    const res = path.join(dir, dirent.name);
    if (dirent.isDirectory()) {
      yield* getFiles(res);
    } else if (res.endsWith("plain.html")) {
      yield Promise.all([res, fs.readFile(res, "utf8")]);
    }
  }
}

const files = getFiles("build");
/** @type {Record<string, string[]>} */
const pathToIds = { __proto__: null };
/** @type {Record<string, string[]>} */
const pathToLinks = { __proto__: null };

for await (const [filename, content] of files) {
  const $ = load(content);
  pathToLinks[filename] = $("a")
    .map((i, el) => $(el).attr("href"))
    .get()
    .map(
      (url) =>
        new URL(
          url.startsWith("#")
            ? `${filename.replace(/build\/|\/plain.html/g, "")}${url}`
            : url,
          "https://developer.mozilla.org"
        )
    )
    .filter(
      (url) =>
        url.hostname === "developer.mozilla.org" &&
        url.hash &&
        url.hash !== "#languages-switcher-button"
    );
  pathToIds[filename] = $("[id]")
    .map((i, el) => $(el).attr("id"))
    .get();
}

for (const [filename, urls] of Object.entries(pathToLinks)) {
  const failures = [];
  for (const url of urls) {
    const referencedPath = path.join(
      "build",
      url.pathname
        .replaceAll("*", "_star_")
        .replaceAll("::", "_doublecolon_")
        .replaceAll(":", "_colon_")
        .replaceAll("?", "_question_")
        .toLowerCase(),
      "plain.html"
    );
    const ids = pathToIds[referencedPath];
    if (ids) {
      const target = ids.find((id) => id === decodeURI(url.hash.slice(1)));
      if (!target) {
        failures.push(
          `Anchor not found: ${url
            .toString()
            .replace("https://developer.mozilla.org", "")}`
        );
      }
    } else if (
      !["/", "/en-US/docs/MDN/Contribute/Feedback", "/docs/MDN/About"].includes(
        url.pathname
      )
    ) {
      failures.push(
        `Page not found: ${url
          .toString()
          .replace("https://developer.mozilla.org", "")}`
      );
    }
  }
  if (failures.length) {
    const sourceFile = path.resolve(
      filename
        .replace("build/en-us/docs", "files/en-us")
        .replace("plain.html", "index.md")
    );
    console.log(`In ${sourceFile}:
${failures.map((e) => `  - ${e}`).join("\n")}
`);
  }
}
OnkarRuikar commented 3 months ago

Anyway I'll drop my script used in github.com/orgs/mdn/discussions/279:

The script doesn't seem to handle definition list anchors:

In /home/onkar/GitHub/mdn/yari/client/files/en-us/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.md:
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#hearing_impairment
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#visual_impairment
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#pausing_capacity
  - Page not found: /client/en-us/docs/learn/common_questions/design_and_accessibility/what_is_accessibility/plain.html#keyboard_capacity

I've updated the broken fragment data in OP. As per my script there seems only 617 broken URL fragments left. The script runs on custom platform, which i developed years ago. It'll take some effort to make is stand alone.

@bsmth We've already plugged the hole so there aren't going to be new broken frags. For existing ~617 ones we need to create a project and a online spreadsheet to track them while fixing them.

Josh-Cena commented 3 months ago

The script doesn't seem to handle definition list anchors:

I don't see those problems in my output. Make sure you have built using the latest content.

Also, the pre-commit hook isn't able to catch anchor errors generated from macros.

OnkarRuikar commented 3 months ago

The script doesn't seem to handle definition list anchors:

I don't see those problems in my output. Make sure you have built using the latest content.

I built it in yari. Doing it in content works.

Also, the pre-commit hook isn't able to catch anchor errors generated from macros.

Yes, for macros the entire content needs to be built. Witch won't be favourable before every commit or in GitHub workflows.

Josh-Cena commented 3 months ago

Closing in favor of https://github.com/mdn/mdn/issues/561

bsmth commented 3 months ago

Closing in favor of mdn/mdn#561

good idea, thanks!