scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
847 stars 113 forks source link

Extruct returns incorrectly formatted description property #113

Closed jakubwasikowski closed 5 years ago

jakubwasikowski commented 5 years ago

Seems that extruct incorrectly interprets description with included HTML tags from microdata.

See the below description extracted from URL https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens:

>>> import extruct
>>> import requests
>>> from w3lib.html import get_base_url
>>> r = requests.get('https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>> data['microdata'][0]['properties']['description']
"Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment PackFor use with Cats and Kittens over 4 weeks of age between 1 and 11kg.Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.Effects on the fleas may be seen as soon as 15 minutes after administration.Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings."

As it can be seen, there is a problem with formatting, like lack of space between "Pack" and "For" or between "11kg." and "Johnson's".

It turns out that the problem is not because of description property content per-se, because it looks correctly on the page source:

<p><strong>Johnsons 4 Fleas Cats &amp; Kittens - 3 Treatment Pack, 6 Treatment Pack</strong></p>For use with Cats and Kittens over 4 weeks of age between 1 and 11kg.<br /><br />Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.<br /><br />Effects on the fleas may be seen as soon as 15 minutes after administration.<br /><br />Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.<br /><br />These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.<br /><br />You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.<br /><br />While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings.

Likely it is a matter of line https://github.com/scrapinghub/extruct/blob/de219cb809676ecdf59dfe8f127a198764be5d4a/extruct/w3cmicrodata.py#L185 where html-text should be used instead of ad-hoc text extraction.

lopuhin commented 5 years ago

Added html so that it can be reproduced later

gh-113.html.zip

jakubwasikowski commented 5 years ago

The issue has been fixed in PR https://github.com/scrapinghub/extruct/pull/119. I'm closing it :tada: