scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
847 stars 113 forks source link

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. #185

Closed div927 closed 2 years ago

lopuhin commented 2 years ago

hi @div927 is this the same as https://github.com/scrapinghub/extruct/issues/142? If not, would you mind providing more details?

div927 commented 2 years ago

@lopuhin just I'm trying to run this.

from requests.models import encode_multipart_formdata import extruct import requests from w3lib.html import get_base_url

r = requests.get("https://www.msn.com/en-in/entertainment/entertainmenttopstories/movie-review-shershaah-vikram-ka-parakram/ar-AANegUL") base_url = get_base_url(r.text, r.url) data = extruct.extract(r.text, base_url=base_url) print(data)

div927 commented 2 years ago

And also some url return [] like if I put url as https://www.hindustantimes.com/entertainment/bollywood/shershaah-movie-review-sincere-sidharth-malhotra-plays-vikram-batra-with-saintly-swagger-in-amazon-s-simplistic-war-drama-101628692350461.html

lopuhin commented 2 years ago

@div927 aha then it's the same issue, let me close this.

Btw it works if you pass response text as bytes instead of a string, like this:

>>> extruct.extract(r.text.encode('utf8'), base_url=base_url)
{'microdata': [],
 'json-ld': [],
 'opengraph': [],
 'microformat': [],
 'rdfa': [{'@id': '_:N906dc123522045f9bf1250eaf4a6e030',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N429280cc64924648997720e86e6b9db6',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menu'}]},
  {'@id': '_:Nff4090c4f212493fb348f3eadb31c3b7',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N0fce1316b0f7468ba2c8eeb68428064d',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menu'}]},
  {'@id': '_:N508f4a3b9b67426cb11cefa751f0805d',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N474461aefb6d48998809c25f5a6bddc9',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N26bf5407720f443b989ed0d84a436d31',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N05c8d4ba56994e95a4999feb8f384d2a',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N6c269f2fe9df4a6cb98fb08dff9097dd',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N0ebe6af113d44466bf8c4a6e158e71a5',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N71530e656afa4301a58a314a1a856044',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:Nf23b9551e7954ad488e7ca5e38e17c7d',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N280b2d436e4f4ddfb6fb3b42784fe66e',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#slider'}]},
  {'@id': '_:N469d8d6d63a24411ab9e10d26a6a85e1',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N2a321dcc1abe458299b5cd44c40954de',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N8c9fd428b70645be9d64a8a46231c965',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
  {'@id': '_:N71676975f62c43af86c3110f9d2b1520',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
  {'@id': '_:N79fd07b83ed948b49a4565b5c914e8d9',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]}]}

But this is something which we need to fix. Let me close this ticket though and let's continue in #142

lopuhin commented 2 years ago

And also some url return []

That means we didn't detect any semantic markup - not all pages have markup and we support not all kinds of it. If you think there is some specific markup which we missed, please open a separate ticket.

lopuhin commented 2 years ago

Closing as a duplicate of #142