opendatalab / magic-html

Apache License 2.0
276 stars 25 forks source link

微信文章报错 #6

Closed naah69 closed 3 months ago

naah69 commented 4 months ago
image
from magic_html import GeneralExtractor
import requests

# 初始化提取器
extractor = GeneralExtractor()

url = 'https://mp.weixin.qq.com/s?__biz=MzI2MzEwNTY3OQ==&mid=2648988210&idx=1&sn=cb5dd15280d63143d0ce706a713026c2&chksm=f34fab2e3a5a3a6d92f8dd43afbdbc0c29737f973967b5f059f10f2a129da293fd3b4350f7ed&scene=27'
resp = requests.get(url)
html=resp.text

# 文章类型HTML提取数据
# data = extractor.extract(html, base_url='https://www.baidu.com')

# 论坛类型HTML提取数据
# data = extractor.extract(html, base_url=url, html_type="forum")
# 微信文章HTML提取数据
data = extractor.extract(html, base_url="https://mp.weixin.qq.com", html_type="weixin")

print(data)
sixgad commented 4 months ago

@naah69 requests.get(url) 直接请求公众号文章获取不到数据的,返回的是验证页面

naah69 commented 4 months ago

我拿掘金的文章试了下也不行,没有内容

from magic_html import GeneralExtractor
import requests

# 初始化提取器
extractor = GeneralExtractor()

url = 'https://juejin.cn/post/7304867278566899764?utm_source=gold_browser_extension'
resp = requests.get(url)
html=resp.text

# 文章类型HTML提取数据
data = extractor.extract(html, base_url='https://juejin.cn')

print(data)
{'xp_num': 'others', 'drop_list': False, 'html': '<html></html>', 'title': None, 'base_url': 'https://juejin.cn'}
naah69 commented 4 months ago

@naah69 requests.get(url) 直接请求公众号文章获取不到数据的,返回的是验证页面

我debug了下,发现可以拿到内容

image

image

sixgad commented 4 months ago

@naah69 如你所说“debug了下,发现可以获得网页请求响应内容“,那么抽取结果是否正呢?这边测试是没问题的 image image

naah69 commented 4 months ago

@naah69 如你所说“debug了下,发现可以获得网页请求响应内容“,那么抽取结果是否正呢?这边测试是没问题的 image image

我这边微信是报错的,其他平台都抽不出来

我这边的版本如下: os: macos 12.5 arm python: 3.8.18 magic_html:0.1.2

naah69 commented 3 months ago

@sixgad 能麻烦问下那边的环境和安装方式吗

naah69 commented 3 months ago

debug了下,发现问题了,我本地的lxml包版本太老了,我用的是4.7.1版本,升级到5.1.1就好了