Closed kylewm closed 10 years ago
ack! threw an exception parsing http://silencematters.com/
kmahan@orin:~/redwind$ tail -n 40 app.log
File "./redwind/api.py", line 121, in convert_mf2
p = Parser(url=url)
File "mf2py/mf2py/parser.py", line 76, in __init__
self.parse()
File "mf2py/mf2py/parser.py", line 275, in parse
parse_el(self.__doc__, ctx, True)
File "mf2py/mf2py/parser.py", line 271, in parse_el
parse_el(child, ctx)
File "mf2py/mf2py/parser.py", line 271, in parse_el
parse_el(child, ctx)
File "mf2py/mf2py/parser.py", line 271, in parse_el
parse_el(child, ctx)
File "mf2py/mf2py/parser.py", line 271, in parse_el
parse_el(child, ctx)
File "mf2py/mf2py/parser.py", line 266, in parse_el
result = handle_microformat(potential_microformats, el, top_level)
File "mf2py/mf2py/parser.py", line 91, in handle_microformat
child_props, child_children = parse_props(child)
File "mf2py/mf2py/parser.py", line 202, in parse_props
child_properties, child_microformats = parse_props(child)
File "mf2py/mf2py/parser.py", line 202, in parse_props
child_properties, child_microformats = parse_props(child)
File "mf2py/mf2py/parser.py", line 202, in parse_props
child_properties, child_microformats = parse_props(child)
File "mf2py/mf2py/parser.py", line 202, in parse_props
child_properties, child_microformats = parse_props(child)
File "mf2py/mf2py/parser.py", line 151, in parse_props
value = parse_property.text(el)
File "mf2py/mf2py/parse_property.py", line 42, in text
return el.get_text()
File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 852, in get_text
strip, types=types)])
File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 851, in <listcomp>
return separator.join([s for s in self._all_strings(
File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 827, in _all_strings
for descendant in self.descendants:
File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 1198, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
Testing on old mf blogs:
post-title
as there is no entry-title
)<a>
tag that is not closed.added a "me too" and a pared down version of the failing html to this BeautifulSoup4 bug https://bugs.launchpad.net/beautifulsoup/+bug/1270611
I'm not sure how to move this forward. As far as I have been able to tell it's 100% a bug in BS4+html5lib when parsing invalid HTML. I definitely don't want to enforce a particular parser, and I don't really want to work too hard to avoid bugs that aren't in mf2py.
One option would be to fail a little bit more gracefully, by catching the AttributeError and throwing a more descriptive one.
Here is my take: this bug with bs4+html5lib is not specific to backwards compatibility. It will happen on any site with similarly malformed HTML.
I think the backcompatibility code can be merged and this bug can be dealt with separately.
I tend to agree. I rebased this branch to the master (to which I also had one minor change to setup.py "BeautifulSoup" => "BeautifulSoup4") and it should be ready for merging
Filled in the backward_compat module with class mappings adappted from php-mf2. BeautifulSoup tags are augmented with mf2-style classes before the usual parsing takes place.
I'm happy with this for now, and as of the second commit, it passes travis-ci for 2.6, 2.7, and 3.3