Closed mirabilos closed 9 months ago
@mirabilos - it doesn't seem reasonable to patch the public Markdownify package to handle the nuances of a particular web page's content. How about preprocessing the HTML in Beautiful Soup to replace truncated link text with the @href
value:
from bs4 import BeautifulSoup
html = """
<p>Link: <a href="https://archiveofourown.org/works/21085382"
rel="nofollow noopener"
target="_blank">https://archiveofourown.org/works/210...</a></p>'
"""
soup = BeautifulSoup(html, 'lxml')
for a in soup.find_all('a', href=True, string=re.compile(r'\.\.\.$')):
a.string = a['href']
then converting the soup
object:
from markdownify import MarkdownConverter
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
which should give you the autolinks you want:
>>> print(md(soup))
Link: <https://archiveofourown.org/works/21085382>
Chris Papademetrious dixit:
@mirabilos - it doesn't seem reasonable to patch the public Markdownify @package to handle the nuances of a particular web page's content. How
But the package already has an autolinks feature, and this fits in well and would help others with this problem (e.g. some Fedi instances also trim links like that). It also has much less overhead.
@about preprocessing the HTML in Beautiful Soup: […]
Thanks for giving this sample code (I haven’t worked with bs4 myself yet). If you’re not merging this (if the arguments above didn’t succeed persuading) but are merging #92, my only other diff, I might go that way, so I have no local diff.
bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999
Not exactly, but…
_cleanup_traildots = re.compile('\\.\\.\\.$')
[…]
for e in html.find_all('a', href=True, string=_cleanup_traildots):
href = str(e['href'])
if href.startswith(str(e.string).rstrip('.')):
e.string.replace_with(href)
… will do. (The docs say to never assign directly to .string
for example.)
@mirabilos - nice solution! I like the use of multiple filter critera.
In my own code, I use raw strings for regex expressions to simplify escaping (r'\.\.\.$'
) but regular strings work too.
Don't forget that subclassing is always an option!
Chris Papademetrious dixit:
@mirabilos - nice solution! I like the use of multiple filter critera.
Thanks!
In my own code, I use raw strings for regex expressions to simplify escaping (
r'\.\.\.$'
) but regular strings work too.
I don’t use Python/py3k raw strings because they make escaping more complicated (e.g. the impossibility to write a single quote), and writing strings for Python/py3k is too hard already anyway, compared with shell, and I’m used to nesting levels of escaping. (Maybe it is visible that I don’t program much in py3k…)
bye, //mirabilos -- „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund, mksh auf jedem System zu installieren.“ -- XTaran auf der OpenRheinRuhr, ganz begeistert (EN: “[…]uhr.gz is a reason to install mksh on every system.”)
Goodreads’ RSS feeds hide parts of the link:
[…]URL: <a href="https://archiveofourown.org/works/21085382" rel="nofollow noopener" target="_blank">https://archiveofourown.org/works/210...</a><br />[…]
Fix: