kartikprabhu / mf2py

mf2 parser in python (this is an experimental fork)
Other
3 stars 2 forks source link

ported backward compatibility classes and properties from php-mf2 #37

Closed kylewm closed 10 years ago

kylewm commented 10 years ago

Filled in the backward_compat module with class mappings adappted from php-mf2. BeautifulSoup tags are augmented with mf2-style classes before the usual parsing takes place.

I'm happy with this for now, and as of the second commit, it passes travis-ci for 2.6, 2.7, and 3.3

kylewm commented 10 years ago

ack! threw an exception parsing http://silencematters.com/

kmahan@orin:~/redwind$ tail -n 40 app.log
  File "./redwind/api.py", line 121, in convert_mf2
    p = Parser(url=url)
  File "mf2py/mf2py/parser.py", line 76, in __init__
    self.parse()
  File "mf2py/mf2py/parser.py", line 275, in parse
    parse_el(self.__doc__, ctx, True)
  File "mf2py/mf2py/parser.py", line 271, in parse_el
    parse_el(child, ctx)
  File "mf2py/mf2py/parser.py", line 271, in parse_el
    parse_el(child, ctx)
  File "mf2py/mf2py/parser.py", line 271, in parse_el
    parse_el(child, ctx)
  File "mf2py/mf2py/parser.py", line 271, in parse_el
    parse_el(child, ctx)
  File "mf2py/mf2py/parser.py", line 266, in parse_el
    result = handle_microformat(potential_microformats, el, top_level)
  File "mf2py/mf2py/parser.py", line 91, in handle_microformat
    child_props, child_children = parse_props(child)
  File "mf2py/mf2py/parser.py", line 202, in parse_props
    child_properties, child_microformats = parse_props(child)
  File "mf2py/mf2py/parser.py", line 202, in parse_props
    child_properties, child_microformats = parse_props(child)
  File "mf2py/mf2py/parser.py", line 202, in parse_props
    child_properties, child_microformats = parse_props(child)
  File "mf2py/mf2py/parser.py", line 202, in parse_props
    child_properties, child_microformats = parse_props(child)
  File "mf2py/mf2py/parser.py", line 151, in parse_props
    value = parse_property.text(el)
  File "mf2py/mf2py/parse_property.py", line 42, in text
    return el.get_text()
  File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 852, in get_text
    strip, types=types)])
  File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 851, in <listcomp>
    return separator.join([s for s in self._all_strings(
  File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 827, in _all_strings
    for descendant in self.descendants:
  File "/home/kmahan/redwind/venv/local/lib/python3.3/dist-packages/bs4/element.py", line 1198, in descendants
    current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
kartikprabhu commented 10 years ago

Testing on old mf blogs:

Passes:

Fails:

kylewm commented 10 years ago

added a "me too" and a pared down version of the failing html to this BeautifulSoup4 bug https://bugs.launchpad.net/beautifulsoup/+bug/1270611

kylewm commented 10 years ago

I'm not sure how to move this forward. As far as I have been able to tell it's 100% a bug in BS4+html5lib when parsing invalid HTML. I definitely don't want to enforce a particular parser, and I don't really want to work too hard to avoid bugs that aren't in mf2py.

One option would be to fail a little bit more gracefully, by catching the AttributeError and throwing a more descriptive one.

kartikprabhu commented 10 years ago

Here is my take: this bug with bs4+html5lib is not specific to backwards compatibility. It will happen on any site with similarly malformed HTML.

I think the backcompatibility code can be merged and this bug can be dealt with separately.

kylewm commented 10 years ago

I tend to agree. I rebased this branch to the master (to which I also had one minor change to setup.py "BeautifulSoup" => "BeautifulSoup4") and it should be ready for merging