internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.21k stars 1.37k forks source link

Lone `**` in markdown creates italics evermore. #9041

Open scottbarnes opened 7 months ago

scottbarnes commented 7 months ago

Problem

Having a lone pair of asterisks (**) on a line in an Open Library field that supports markdown (such as the description field of an Edition or Work) causes the following text for the rest of the page to be in italics.

Note: this does not appear to happen with lone asterisks (*). Instead, it the Open Library implementation seems to accurately follow GitHub, insofar as nothing happens and there are no italics.

Evidence / Screenshot

italics start here**
and never end.

image

Relevant URL(s)

Reproducing the bug

  1. Go to an edition description
  2. Do enter ** somewhere on a line and save.

Context

Notes from this Issue's Lead

Proposal & constraints

We appear to follow GitHub Flavored markdown, which doesn't appear to support italics or bold spanning lines. The latter case is one likely way people will run into the bug.

However, having unmatched ** on a line in GitHub (e.g. here in this form) doesn't seem to cause an italicsfest. Instead, it merely does nothing. We should consider following GitHub's behavior.

Related files

Stakeholders

benbdeitch commented 3 months ago

After a little bit of looking into this, I think I might know exactly what is causing this. Could you assign me?

benbdeitch commented 3 months ago

Yeah, I have the exact cause of this right here.

image

This is how the patterns we use for markdown are defined in our code. On line 6679, changing the regex to EMPHASIS_RE = r'\*([^\*]+)\*' # *emphasis* solves the issue on my local version. It seems that this issue occurs whenever the program matches the case where there are no intermediary characters between the two.

If four asterisks are inserted -- the beginning and end of the pattern for bolding, this occurs: image

However, by changing the quantifiers involved to +, mandating one or more characters to be separating the beginning and endpoint, the problem is solved.

Case in point: image

image

This seems to be the case for all patterns of this sort.

jimchamp commented 3 months ago

Noting that this is the file referenced in this comment, and that this code seems to be a much older version of this library. Also noting the the current version of markdown has updated regular expressions for markdown tokens, and the change is roughly equivalent to yours, @benbdeitch (it checks for at least one non-asterisk character between the **). We have a few other server-side markdown issues that deal with markdown not being parsed as expected. I'm wondering if updating Infogami to use the current markdown library would close these issues? I noticed that markdown is included in Infogami's requirements.txt, and that this library was briefly used in macro.py, but that change was rolled back. I'm marking this as https://github.com/internetarchive/openlibrary/labels/State%3A%20Blocked until I can figure out the following: