kartikprabhu / mf2py

mf2 parser in python (this is an experimental fork)
Other
3 stars 2 forks source link

Use html5lib by default #59

Closed kartikprabhu closed 6 years ago

kartikprabhu commented 6 years ago

Currently mf2py defers to BeautifulSoup to choose the correct HTML parser. BS chooses in order lxml > html5lib > html.parser (see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use ) .

For malformed HTML these give different results ( see https://github.com/kartikprabhu/mf2py/issues/58#issuecomment-366536321 ). But html5lib is the closest to browser behaviour.

So we should use html5lib by default unless over-ridden by user; if html5lib is not installed then differ to BS defaults.

related #41

Discuss! vote.

cc @kevinmarks @sknebel @bear @tommorris @snarfed

kevinmarks commented 6 years ago

+1 on using html5lib by default. The html.parser does very odd things.

kartikprabhu commented 6 years ago

also cc @tantek for outside viewpoint

snarfed commented 6 years ago

fwiw bridgy has overridden mf2py for years to use lxml since it's significantly faster (actually more like less slow: https://github.com/snarfed/bridgy/issues/578#issuecomment-189296654 ), and at scale i care about that more than minor edge case and compatibility differences.

kartikprabhu commented 6 years ago

@snarfed so making html5lib default will not affect the Brigdy performance since it is overridden anyway?

aside: Still curious if the implied-name-fix solves your instagram woes.

kevinmarks commented 6 years ago

I remember adding lxml to speed up html5lib, and doing a bit of profiling before, but would the bridgy archive be a good corpus to profile against to see if there are bottlenecks?

On 18 Feb 2018 20:22, "Ryan Barrett" notifications@github.com wrote:

fwiw bridgy has overridden mf2py for years https://github.com/snarfed/bridgy/blob/master/util.py#L632 to use lxml since it's significantly faster https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers (actually more like less slow: snarfed/bridgy#578 (comment) https://github.com/snarfed/bridgy/issues/578#issuecomment-189296654 ), and at scale i care about that more than minor edge case and compatibility differences.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kartikprabhu/mf2py/issues/59#issuecomment-366544556, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGCwH-alMMoTc1WaLF09yDQ4mp-0C1dks5tWIahgaJpZM4SJzWo .

kartikprabhu commented 6 years ago

fixed by ^ but keeping open