scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
846 stars 113 forks source link

Some websites put meta tags outside the head. #192

Open paul-rchds opened 2 years ago

paul-rchds commented 2 years ago

On some pages meta tags are included outside of the head tag. For example on the YouTube channel page: https://www.youtube.com/c/Freecodecamp

As the opengraph extractor only looks in the head tag, all the og:* meta properties are missed. In my fork, I changed the extractor to look in the body rather.

If I get permission, I can do a PR?

Here is a link to where I made the change: https://github.com/scrapinghub/extruct/blob/c2cffbed26ae4ab8dd35d1860bfda00c3bac5783/extruct/opengraph.py#L28

lopuhin commented 2 years ago

hi @paul-rchds yes, that would be great - I noticed the same issue myself but didn't get to implement everything required, here is a link to a PR https://github.com/scrapinghub/extruct/pull/129/ - feel free to start a new one.

frostrot commented 2 years ago

I have changed the functionality of the extract_item function in OpengraphExtractor class, to incorporate the meta tags outside of the head. Have tested it on the link shared by @paul-rchds . Please review my PR for its workability. Thanks