scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
851 stars 113 forks source link

Empty return on webpages #157

Closed ichitaka closed 4 years ago

ichitaka commented 4 years ago

So I'm currently investigating the issue, that for a lot of websites this tool is sadly not working. To provide an example, please consider this url: url = "https://elavegan.com/de/nudeln-mit-knoblauchsosse/" If i run the request and extract the structured data with the following command

response = requests.get(url, headers=headers)
base_url = get_base_url(response.text, response.url)
schema_items = extruct.extract(response.text, base_url=base_url, uniform=True)

I receive this empty response

{'microdata': [],
 'json-ld': [],
 'opengraph': [],
 'microformat': [],
 'rdfa': []}

Structured data is available through the google test tool and the response.text is not empty and I can find the fields in it (let's say 'recipeYield'). I have a bunch of URLs that are behaving this way and I could not figure out why this is. No robots.txt is blocking me either.

lopuhin commented 4 years ago

@ichitaka could you please check the contents of response.text? I tried the URL you posted and it returns lots of semantic markup for me, including recipe info:

>>> import extruct                                                                                                                                                             
>>> import requests                                                                                                                                                            
>>> response = requests.get('https://elavegan.com/de/nudeln-mit-knoblauchsosse/')                                                                                              
>>> extruct.extract(response.text, uniform=True)
{'microdata': [],                                                                                                                                                                  
 'json-ld': [{'@context': 'https://schema.org',                                                                                                                                    
   '@graph': [{'@type': 'Organization',                                                                                                                                            
     '@id': 'https://elavegan.com/de/#organization',                                                                                                                               
     'name': 'ElaVegan',                                                                                                                                                           
     'url': 'https://elavegan.com/de/',                                                                                                                                            
     'sameAs': [],                                                                                                                                                                 
     'logo': {'@type': 'ImageObject',                                                                                                                                              
      '@id': 'https://elavegan.com/de/#logo',                                                                                                                                      
      'inLanguage': 'de-DE',                                                                                                                                                       
      'url': 'https://elavegan.com/de/wp-content/uploads/sites/5/2019/09/new-logo-elavegan.png',                                                                                   
      'width': 550,
      'height': 236,
      'caption': 'ElaVegan'},
     'image': {'@id': 'https://elavegan.com/de/#logo'}},
....
ichitaka commented 4 years ago

Yes the response.text does contain the expected information. Right now I pinned down the issue. I don't have this with a fresh environment. This would mean, that there is some kind of dependency issue.

lopuhin commented 4 years ago

@ichitaka I see, thanks for double-checking. Please reply if you find a way to reproduce it, I'll close the issue for now.