indix / web-auto-extractor

Automatically extracts structured information from webpages
MIT License
108 stars 30 forks source link

Extraction of arrays #6

Closed dwolters closed 8 years ago

dwolters commented 8 years ago

I'm using the web auto extractor to retrieve recipes embedded as microdata. Unfortunately, the extractor is not extracting the ingridient list as an array. Instead the ingridients replace each other in the object so that only the last ingridient remains.

Example:

<ul class="ingredientList">`
  <li itemprop="ingredients">&frac12; cup butter</li>`
  <li itemprop="ingredients">&frac12; cup powdered sugar</li>`
  <li itemprop="ingredients">&frac12; cup chocolate chips</li>`
</ul>

The web auto extractor only extracts: { ..., ingedients : "&frac12; cup chocolate chips", ...} Instead of: { ..., ingedients : ["&frac12; cup butter","&frac12; cup powdered sugar","&frac12; cup chocolate chips"], ...}

An example page with recipe microdata can be found here: getmecooking.com

Is this behavior by design?

The official schema.org Recipe example allows this kind of multi declaration of ingridients.

addnab commented 8 years ago

I assumed all lists would follow the ItemList schema. Looking at the case you've provided, I see it's not really the case. It must be fixed. I'll review your PR and get back to you. Thanks.

addnab commented 8 years ago

Fixed https://github.com/ind9/web-auto-extractor/pull/7