Closed codinguncut closed 7 years ago
@codinguncut ,
Although it's not documented (nor explicitly tested), 2 of the extractors already support passing an lxml document directly (e.g. result of an lxml parser's .fromstring()
, which is how it's implemented for .extract()
):
extruct.jsonld.JsonLdExtractor.extract_items(document)
extruct.w3cmicrodata.LxmlMicrodataExtractor.extract_items(document, url)
The RDFa extractor is a bit different since rdflib is tricked into thinking it is handling an xml.dom
tree, but the lxml parser is available: extruct.rdfa.XmlDomHTMLParser
and a method can be added to pass an xml.dom
compatible tree.
Regarding speeding up JSON parsing, is usjon
the best option these day? (honest question, I haven't used it in a long time)
Hi, thank you for sharing this (undocumented) functionality.
I'm not sure if ujson
is the "best" option (whatever that means), but it's significantly faster than vanilla json
(especially on py2) and I've been using it as a robust drop-in replacement.
http://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/
I meant "best" as "fastest" as this is one of your points. PRs for trying ujson if available and documentation of extractors methods are welcome.
extract_items(document, url, *args, **kwargs)
methods have been added to all extractors, taking an lxml-parsed document as input.
I've moved the usjon feature request to another issue #49
please add functions to provide a pre-parsed lxml.etree instead of htmlstring. Also, using a library such as "ujson" may significantly speedup processing for jsonld.