Closed kartikprabhu closed 6 years ago
+1 on using html5lib by default. The html.parser does very odd things.
also cc @tantek for outside viewpoint
fwiw bridgy has overridden mf2py for years to use lxml since it's significantly faster (actually more like less slow: https://github.com/snarfed/bridgy/issues/578#issuecomment-189296654 ), and at scale i care about that more than minor edge case and compatibility differences.
@snarfed so making html5lib default will not affect the Brigdy performance since it is overridden anyway?
aside: Still curious if the implied-name-fix solves your instagram woes.
I remember adding lxml to speed up html5lib, and doing a bit of profiling before, but would the bridgy archive be a good corpus to profile against to see if there are bottlenecks?
On 18 Feb 2018 20:22, "Ryan Barrett" notifications@github.com wrote:
fwiw bridgy has overridden mf2py for years https://github.com/snarfed/bridgy/blob/master/util.py#L632 to use lxml since it's significantly faster https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers (actually more like less slow: snarfed/bridgy#578 (comment) https://github.com/snarfed/bridgy/issues/578#issuecomment-189296654 ), and at scale i care about that more than minor edge case and compatibility differences.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kartikprabhu/mf2py/issues/59#issuecomment-366544556, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGCwH-alMMoTc1WaLF09yDQ4mp-0C1dks5tWIahgaJpZM4SJzWo .
fixed by ^ but keeping open
Currently mf2py defers to BeautifulSoup to choose the correct HTML parser. BS chooses in order lxml > html5lib > html.parser (see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use ) .
For malformed HTML these give different results ( see https://github.com/kartikprabhu/mf2py/issues/58#issuecomment-366536321 ). But html5lib is the closest to browser behaviour.
So we should use html5lib by default unless over-ridden by user; if html5lib is not installed then differ to BS defaults.
related #41
Discuss! vote.
cc @kevinmarks @sknebel @bear @tommorris @snarfed