psf / gh-migration

This repo is used to manage the migration from bugs.python.org to GitHub.
42 stars 8 forks source link

Message conversion and formatting #3

Closed ezio-melotti closed 2 years ago

ezio-melotti commented 4 years ago

This issue is about converting and formatting the content (text) of the bpo messages (not the issue metadata) before importing them into GitHub.

bpo messages are raw text with no formatting, whereas GitHub issues use Markdown. If messages are imported directly, special characters in the bpo messages might be wrongly interpreted as Markdown formatting, resulting in erroneous rendering.

Possible solutions:

  1. Import messages within code-block markup, to render it literally:
    • quick and easy solution, but the result looks ugly
    • SymPy used this approach (see e.g. this issue)
  2. Import messages as normal text, but escape special characters
    • can this be done reliably?
    • are there already existing tools that can do it?
  3. Detect and convert to Markdown links, code blocks, lists, etc.
    • can this be done reliably?
    • are there already existing tools that can do it?

Edit: I went with option 3. It's not perfect, but it seems to work well enough.

Other considerations:

TODO:

ammaraskar commented 3 years ago
  1. Import messages within code-block markup, to render it literally:

One nice aspect of this is that bpo issues are currently displayed monospace. Without adequate conversion of code blocks such as in (3), code snippets and places where alignment is important would look broken. (3) seems quite hard to implement though, a lot of the formatting is ad-hoc (I know I've personally sometimes kept code blocks on the same level or indented them with 2 or 4 spaces occasionally). It seems like (1) might be the way to go. If we really want (2) there are mature libraries like turndown.

On bpo, links to other issues, messages, PRs, PEPs, etc. are added at rendering time using regexes.

This is probably less of a concern with https://docs.github.com/en/github/administering-a-repository/managing-repository-settings/configuring-autolinks-to-reference-external-resources which we already have set up for bpo links on the CPython repo thanks to @mariatta in https://github.com/python/core-workflow/issues/361

Issues numbers can also be remapped from the bpo to the GH numbers during the same step.

Depending on the resolution to #2, it might be nice to keep the links the same and then the bugs.python.org/issuexxx links end up redirecting you to the right Github page. But assuming we make the roundup instances read-only or mirrored then remapping the issues is probably a good idea.

ezio-melotti commented 2 years ago

We have:

We want:

It seems that:

If we want to implement option 2 from the first message, we could:

If we want to implement option 3, we could parse each paragraph independently and:

I'll do some tests on a real world sample to see if we can reach a good compromise.