scrapinghub / mdr

A python library detect and extract listing data from HTML page.
109 stars 29 forks source link

bug in the alg implementation: RecordAligner.align misses one of the childs #6

Open dportabella opened 7 years ago

dportabella commented 7 years ago
from lxml.html import etree
from mdr import RecordAligner, Record

def toString(tree):
    return etree.tostring(tree, pretty_print=True)

t1 = etree.XML("""<root><a><a1/></a><b/><c/></root>""")
t2 = etree.XML("""<root><a/><b><b1/></b><c/></root>""")

seed, mappings = RecordAligner().align([Record(t1), Record(t2)])
print toString(seed[0])

seed, mappings = RecordAligner().align([Record(t2), Record(t1)])
print toString(seed[0])

# <root>
#  <a/>
#  <b>
#    <b1/>
#  </b>
#  <c/>
# </root>
#
# <root>
#  <a>
#    <a1/>
#  </a>
#  <b/>
#  <c/>
# </root>
#
# shouldn't it be:
# <root>
#  <a>
#    <a1/>
#  </a>
#  <b>
#    <b1/>
#  </b>
#  <c/>
# </root>

is this a problem in the algorithm or in the implementation?