Open iaincollins opened 5 years ago
Thanks for creating such a great project!
I ran into a bug parsing microdata content where itemprop contained multiple properties, like in these examples and thought I'd share what I ran into:
itemprop
<meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-07-21T09:00:06.000Z"/>
<span itemProp="publisher copyrightHolder provider sourceOrganization" itemscope="" itemType="http://schema.org/NewsMediaOrganization" itemID="https://www.nytimes.com">
<figure itemprop="associatedMedia image" itemscope itemtype="http://schema.org/ImageObject" data-component="image" class="element element-image img--landscape fig--narrow-caption fig--has-shares " data-media-id="f82028d62b1edd7417d7d3773c4abf0d4fa86174" id="img-3"> <meta itemprop="url" content="https://i.guim.co.uk/img/media/f82028d62b1edd7417d7d3773c4abf0d4fa86174/0_272_6435_3861/master/6435.jpg?width=700&quality=85&auto=format&fit=max&s=016df6a3f33eabe3cbca39eb389a60fb"> </figure>
Markup like this is parsed correctly in Google's Structured Data Testing Tool, but web-auto-extractor does not currently split input based on spaces.
web-auto-extractor
I resolved this in a project which uses web-auto-extractor by doing this:
const __transformStructuredData = (structuredData) => { let result = structuredData Object.keys(result.microdata).forEach(schema => { result.microdata[schema].forEach(object => { Object.keys(object).forEach(key => { if (key.includes(' ')) { key.split(' ').forEach(newKey => { object[newKey] = object[key] }) delete object[key] } }) }) }) return result }
I'm aware there are some other PRs related to handling whitespace trimming open.
If an enhancement like this appeals I'd be happy to raise a PR.
Thanks for creating such a great project!
I ran into a bug parsing microdata content where
itemprop
contained multiple properties, like in these examples and thought I'd share what I ran into:Markup like this is parsed correctly in Google's Structured Data Testing Tool, but
web-auto-extractor
does not currently split input based on spaces.I resolved this in a project which uses
web-auto-extractor
by doing this:I'm aware there are some other PRs related to handling whitespace trimming open.
If an enhancement like this appeals I'd be happy to raise a PR.