scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
846 stars 113 forks source link

support pre-parsed lxml.etree and faster json #37

Closed codinguncut closed 7 years ago

codinguncut commented 7 years ago

please add functions to provide a pre-parsed lxml.etree instead of htmlstring. Also, using a library such as "ujson" may significantly speedup processing for jsonld.

redapple commented 7 years ago

@codinguncut , Although it's not documented (nor explicitly tested), 2 of the extractors already support passing an lxml document directly (e.g. result of an lxml parser's .fromstring(), which is how it's implemented for .extract()):

The RDFa extractor is a bit different since rdflib is tricked into thinking it is handling an xml.dom tree, but the lxml parser is available: extruct.rdfa.XmlDomHTMLParser and a method can be added to pass an xml.dom compatible tree.

Regarding speeding up JSON parsing, is usjon the best option these day? (honest question, I haven't used it in a long time)

codinguncut commented 7 years ago

Hi, thank you for sharing this (undocumented) functionality.

I'm not sure if ujson is the "best" option (whatever that means), but it's significantly faster than vanilla json (especially on py2) and I've been using it as a robust drop-in replacement.

http://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/

redapple commented 7 years ago

I meant "best" as "fastest" as this is one of your points. PRs for trying ujson if available and documentation of extractors methods are welcome.

redapple commented 7 years ago

extract_items(document, url, *args, **kwargs) methods have been added to all extractors, taking an lxml-parsed document as input. I've moved the usjon feature request to another issue #49