Closed div927 closed 2 years ago
@lopuhin just I'm trying to run this.
from requests.models import encode_multipart_formdata
import extruct
import requests
from w3lib.html import get_base_url
r = requests.get("https://www.msn.com/en-in/entertainment/entertainmenttopstories/movie-review-shershaah-vikram-ka-parakram/ar-AANegUL")
base_url = get_base_url(r.text, r.url)
data = extruct.extract(r.text, base_url=base_url)
print(data)
And also some url return []
like if I put url as https://www.hindustantimes.com/entertainment/bollywood/shershaah-movie-review-sincere-sidharth-malhotra-plays-vikram-batra-with-saintly-swagger-in-amazon-s-simplistic-war-drama-101628692350461.html
@div927 aha then it's the same issue, let me close this.
Btw it works if you pass response text as bytes instead of a string, like this:
>>> extruct.extract(r.text.encode('utf8'), base_url=base_url)
{'microdata': [],
'json-ld': [],
'opengraph': [],
'microformat': [],
'rdfa': [{'@id': '_:N906dc123522045f9bf1250eaf4a6e030',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N429280cc64924648997720e86e6b9db6',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menu'}]},
{'@id': '_:Nff4090c4f212493fb348f3eadb31c3b7',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N0fce1316b0f7468ba2c8eeb68428064d',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menu'}]},
{'@id': '_:N508f4a3b9b67426cb11cefa751f0805d',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N474461aefb6d48998809c25f5a6bddc9',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N26bf5407720f443b989ed0d84a436d31',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N05c8d4ba56994e95a4999feb8f384d2a',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N6c269f2fe9df4a6cb98fb08dff9097dd',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N0ebe6af113d44466bf8c4a6e158e71a5',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N71530e656afa4301a58a314a1a856044',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:Nf23b9551e7954ad488e7ca5e38e17c7d',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N280b2d436e4f4ddfb6fb3b42784fe66e',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#slider'}]},
{'@id': '_:N469d8d6d63a24411ab9e10d26a6a85e1',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N2a321dcc1abe458299b5cd44c40954de',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N8c9fd428b70645be9d64a8a46231c965',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#menuitem'}]},
{'@id': '_:N71676975f62c43af86c3110f9d2b1520',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]},
{'@id': '_:N79fd07b83ed948b49a4565b5c914e8d9',
'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]}]}
But this is something which we need to fix. Let me close this ticket though and let's continue in #142
And also some url return []
That means we didn't detect any semantic markup - not all pages have markup and we support not all kinds of it. If you think there is some specific markup which we missed, please open a separate ticket.
Closing as a duplicate of #142
hi @div927 is this the same as https://github.com/scrapinghub/extruct/issues/142? If not, would you mind providing more details?