matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.12k stars 137 forks source link

autolink when Goodreads breaks the URL #82

Closed mirabilos closed 9 months ago

mirabilos commented 1 year ago

Goodreads’ RSS feeds hide parts of the link: […]URL: <a href="https://archiveofourown.org/works/21085382" rel="nofollow noopener" target="_blank">https://archiveofourown.org/works/210...</a><br />[…]

Fix:

diff --git a/markdownify/__init__.py b/markdownify/__init__.py
index e15ecd4..36e15e7 100644
--- a/markdownify/__init__.py
+++ b/markdownify/__init__.py
@@ -221,11 +221,17 @@ class MarkdownConverter(object):
         title = el.get('title')
         # For the replacement see #29: text nodes underscores are escaped
         if (self.options['autolinks']
-                and text.replace(r'\_', '_') == href
                 and not title
                 and not self.options['default_title']):
-            # Shortcut syntax
-            return '<%s>' % href
+            rtext = text.replace(r'\_', '_')
+            if rtext.endswith('...') and rtext.startswith('http'):
+                # Goodreads-shortened link?
+                if href.startswith(rtext.rstrip('.')):
+                    # force match
+                    rtext = href
+            if rtext == href:
+                # Shortcut syntax
+                return '<%s>' % href
         if self.options['default_title'] and not title:
             title = href
         title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
chrispy-snps commented 10 months ago

@mirabilos - it doesn't seem reasonable to patch the public Markdownify package to handle the nuances of a particular web page's content. How about preprocessing the HTML in Beautiful Soup to replace truncated link text with the @href value:

from bs4 import BeautifulSoup
html = """
<p>Link: <a href="https://archiveofourown.org/works/21085382"
     rel="nofollow noopener"
     target="_blank">https://archiveofourown.org/works/210...</a></p>'
"""
soup = BeautifulSoup(html, 'lxml')

for a in soup.find_all('a', href=True, string=re.compile(r'\.\.\.$')):
    a.string = a['href']

then converting the soup object:

from markdownify import MarkdownConverter
def md(soup, **options):
    return MarkdownConverter(**options).convert_soup(soup)

which should give you the autolinks you want:

>>> print(md(soup))
Link: <https://archiveofourown.org/works/21085382>
mirabilos commented 10 months ago

Chris Papademetrious dixit:

@mirabilos - it doesn't seem reasonable to patch the public Markdownify @package to handle the nuances of a particular web page's content. How

But the package already has an autolinks feature, and this fits in well and would help others with this problem (e.g. some Fedi instances also trim links like that). It also has much less overhead.

@about preprocessing the HTML in Beautiful Soup: […]

Thanks for giving this sample code (I haven’t worked with bs4 myself yet). If you’re not merging this (if the arguments above didn’t succeed persuading) but are merging #92, my only other diff, I might go that way, so I have no local diff.

bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999

mirabilos commented 9 months ago

Not exactly, but…

_cleanup_traildots = re.compile('\\.\\.\\.$')
[…]
for e in html.find_all('a', href=True, string=_cleanup_traildots):
    href = str(e['href'])
    if href.startswith(str(e.string).rstrip('.')):
        e.string.replace_with(href)

… will do. (The docs say to never assign directly to .string for example.)

chrispy-snps commented 9 months ago

@mirabilos - nice solution! I like the use of multiple filter critera.

In my own code, I use raw strings for regex expressions to simplify escaping (r'\.\.\.$') but regular strings work too.

matthewwithanm commented 9 months ago

Don't forget that subclassing is always an option!

mirabilos commented 9 months ago

Chris Papademetrious dixit:

@mirabilos - nice solution! I like the use of multiple filter critera.

Thanks!

In my own code, I use raw strings for regex expressions to simplify escaping (r'\.\.\.$') but regular strings work too.

I don’t use Python/py3k raw strings because they make escaping more complicated (e.g. the impossibility to write a single quote), and writing strings for Python/py3k is too hard already anyway, compared with shell, and I’m used to nesting levels of escaping. (Maybe it is visible that I don’t program much in py3k…)

bye, //mirabilos -- „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund, mksh auf jedem System zu installieren.“ -- XTaran auf der OpenRheinRuhr, ganz begeistert (EN: “[…]uhr.gz is a reason to install mksh on every system.”)