indix / web-auto-extractor

Automatically extracts structured information from webpages
MIT License
108 stars 30 forks source link

Improve parsing microdata when itemProps contains multiple space separated properties #26

Open iaincollins opened 5 years ago

iaincollins commented 5 years ago

Thanks for creating such a great project!

I ran into a bug parsing microdata content where itemprop contained multiple properties, like in these examples and thought I'd share what I ran into:

<meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-07-21T09:00:06.000Z"/>
<span itemProp="publisher copyrightHolder provider sourceOrganization" itemscope="" itemType="http://schema.org/NewsMediaOrganization" itemID="https://www.nytimes.com">
<figure itemprop="associatedMedia image" itemscope itemtype="http://schema.org/ImageObject" data-component="image" class="element element-image img--landscape  fig--narrow-caption fig--has-shares " data-media-id="f82028d62b1edd7417d7d3773c4abf0d4fa86174" id="img-3">
  <meta itemprop="url" content="https://i.guim.co.uk/img/media/f82028d62b1edd7417d7d3773c4abf0d4fa86174/0_272_6435_3861/master/6435.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=016df6a3f33eabe3cbca39eb389a60fb">
</figure>

Markup like this is parsed correctly in Google's Structured Data Testing Tool, but web-auto-extractor does not currently split input based on spaces.

I resolved this in a project which uses web-auto-extractor by doing this:

const __transformStructuredData = (structuredData) => {
   let result = structuredData
   Object.keys(result.microdata).forEach(schema => {
     result.microdata[schema].forEach(object => {
       Object.keys(object).forEach(key => {
         if (key.includes(' ')) {
           key.split(' ').forEach(newKey => {
             object[newKey] = object[key]
           })
           delete object[key]
         }
       })
     })
   })
   return result
 }

I'm aware there are some other PRs related to handling whitespace trimming open.

If an enhancement like this appeals I'd be happy to raise a PR.