microformats / mf2py

Microformats2 parser written in Python
http://microformats.github.io/mf2py/
Other
100 stars 28 forks source link

backcompat.py throws bs4 warning #105

Closed wumpus closed 6 years ago

wumpus commented 6 years ago
from bs4 import BeautifulSoup

parser = BeautifulSoup('<data></data>').data

spews:

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

Perhaps you want to use the same default as parser.py of 'html5lib' ?

Also parser.py has an except FeatureNotFound block that calls BeautifulSoup(doc) and will also generate this ugly warning.

kartikprabhu commented 6 years ago

backcompat.py does not do any parsing explicitly.

Those warnings are generated by BeautifulSoup if it does not find the parser specified, which is what you seemed to have used directly.

mf2py defaults to the user-specified parser or to html5lib. If neither works it just defers to BeautifulSoup.

wumpus commented 6 years ago

I don't understand your comment. I quoted the line in backcompat.py that creates a bs parser. "Deferring to BeautifulSoup" causes the warning. This warning is new, it's intended to get everyone to change their code to specify a parser to use.

kartikprabhu commented 6 years ago

@wumpus Ah! yes sorry I got a bit confused. Will fix in the next update.

Thanks!

wumpus commented 6 years ago

Thank you!

kartikprabhu commented 6 years ago

self-note:

possible resolution: put back the older code from https://github.com/microformats/mf2py/blob/65c3699fc964370ccee65ab54fec8ca9febe65a2/mf2py/backcompat.py#L207 BS documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring-and-new-tag

and recheck if the same html parser is then used with this change.

kartikprabhu commented 6 years ago

After a bit more thinking here are possible ways to fix this each with some drawbacks cc: @kevinmarks @sknebel any suggestions are appreciated

  1. use new_tag method to create the new <data> element. This method only exists on the main BS doc object and will fail if the user passes a BS element instead to the parser. This is why the older code from https://github.com/microformats/mf2py/blob/65c3699fc964370ccee65ab54fec8ca9febe65a2/mf2py/backcompat.py#L207 was changed.

  2. Specify html5lib directly while creating the <data> element. This has the disadvantage that now the default parser is declared in multiple locations and has the risk of going out-of-sync. Also, this will ignore any other parser the user specifies.

Not sure what the way out is.

wumpus commented 6 years ago

backcompat never paid any attention to the user-specified parser and it only parses that one string. I don't think adding a default of html5lib will cause any harm.

kartikprabhu commented 6 years ago

@wumpus the trouble with that is if someone does not have html5lib installed it will throw an error and stop parsing. At least right now it only throws a warning but uses whatever parser it can find.

wumpus commented 6 years ago

ok then put a try/except block around it similar to your other code. (I have no idea what parsers are installed by default, etc.)