scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
847 stars 113 forks source link

Matching Order in LxmlMicrodataExtractor._extract_property_value #160

Open kelvinso opened 3 years ago

kelvinso commented 3 years ago

I noticed that the matching order of _extract_property_value seems to be inconsistent with https://www.w3.org/TR/microdata/#values. In this doc, it mentions that the 2nd matching case is "If the element has a content attribute". However, in LxmlMicrodataExtractor._extract_property_value, it is 2nd to the last in the matching order.

Should this case

 elif node.get("content"):
            return node.get("content")

in w3cmicrodata.py be moved before resolving for meta tag at line 186?

Thanks a lot! Kelvin

Gallaecio commented 3 years ago

Yeah, it looks like the changes they’ve made to the specification since 2013 (that code is from 2014) include allowing content on any node, which back in 2013 was non-standard yet supported by extruct.

We should probably review the standard changes in general, there may be more surprises.