indix / web-auto-extractor

Automatically extracts structured information from webpages
MIT License
108 stars 30 forks source link

Parsing microdata strips spaces #21

Open hkdobrev opened 6 years ago

hkdobrev commented 6 years ago

Given the following HTML:

<div itemscope itemtype="http://schema.org/Product"><h1 itemprop="name"><span>Foo</span> Bar</h1></div>

I would expect the library to extract a Product with the name of Foo Bar, but it extracts FooBar omitting the space.

Do you think this would be an easy fix?

hkdobrev commented 6 years ago

@Vasanth-Indix @addnab Do you think the above is a valid expectation? Do you think you'd be able to address it or point me in the right direction? Thanks!

Vasanth-Indix commented 6 years ago

Yes @hkdobrev. It's a valid expectation. We will look into it.