obaby / mht-image-extractor

mht文件图片解析工具
23 stars 6 forks source link

无法转换qq浏览器和chrome浏览器生成的mhtml文件 #1

Closed masx200 closed 4 years ago

masx200 commented 4 years ago

你知道发起一次DDOS攻击需要多少费用吗? - 知乎.mht.zip

汉服-彼岸花开.mhtml.zip

qq浏览器保存的文件,什么都提取不出来


 python /storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/baby_mht_image_extractor.py -f "/storage/emulated/0/我的文档/你知道发起一次DDOS攻击需要多少费用吗? - 知乎.mht"  -o /storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/out
****************************************************************************************************
[S] 开始任务......
[C] 输入文件:/storage/emulated/0/我的文档/你知道发起一次DDOS攻击需要多少费用吗? - 知乎.mht
[C] 输入目录:
[C] 输出目录:/storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/out
[D] 导出全部完成。
****************************************************************************************************

chrome浏览器保存的文件,直接报错

python /storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/baby_mht_image_extractor.py -f "/storage/emulated/0/我的网页保存/公众号/汉服-彼岸花开.mhtml"  -o /storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/out
****************************************************************************************************
[S] 开始任务......
[C] 输入文件:/storage/emulated/0/我的网页保存/公众号/汉服-彼岸花开.mhtml
[C] 输入目录:
[C] 输出目录:/storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/out
Traceback (most recent call last):
  File "/storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/baby_mht_image_extractor.py", line 190, in <module>
    main(sys.argv[1:])
  File "/storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/baby_mht_image_extractor.py", line 173, in main
    save_mht_all_images(input_file)
  File "/storage/emulated/0/Download/masx200-mht-image-extractor-master/mht-image-extractor/baby_mht_image_extractor.py", line 129, in save_mht_all_images
    body_content = f.read()
  File "/data/data/com.termux/files/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 863095: invalid start byte

这两个使用浏览器都能看的

masx200 commented 4 years ago

qq浏览器保存的网页是这样的

图片文件是这样的

Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-Location: https://pic3.zhimg.com/v2-f93341625ac2b5147b60e57f6999660d_s.jpg
------MultipartBoundary--VjK26H6J1hen3mSUiigyebg9rwgfVt3ww0WPr7Q2V5------
masx200 commented 4 years ago

chrome浏览器保存的是这样的

图片文件是这样的


------MultipartBoundary--Bx5ubV1DnfL8hvvsySfZL6MQeLa58tWkfwrQGpothO----
Content-Type: image/bmp
Content-Transfer-Encoding: binary
Content-Location: https://mp.weixin.qq.com/mp/qrcode?scene=10000004&size=102&__biz=MzU1NzQ3MTg5OQ==&mid=2247483652&idx=1&sn=a16979f8b088cb60fb63f210536d5288&send_time=
masx200 commented 4 years ago

这两个浏览器保存的网页开头都是这样的

From: <Saved by Blink>
Snapshot-Content-Location: https://zhuanlan.zhihu.com/p/83130377
Subject: =?utf-8?Q?=E4=BD=A0=E7=9F=A5=E9=81=93=E5=8F=91=E8=B5=B7=E4=B8=80=E6=AC=A1?=
 =?utf-8?Q?DDOS=E6=94=BB=E5=87=BB=E9=9C=80=E8=A6=81=E5=A4=9A=E5=B0=91=E8?=
 =?utf-8?Q?=B4=B9=E7=94=A8=E5=90=97=EF=BC=9F=20-=20=E7=9F=A5=E4=B9=8E?=
Date: Sun, 19 Sep 2020 23:57:55 -0000
MIME-Version: 1.0
Content-Type: multipart/related;
    type="text/html";
    boundary="----MultipartBoundary--VjK26H6J1hen3mSUiigyebg9rwgfVt3ww0WPr7Q2V5----"

From: <Saved by Blink>
Snapshot-Content-Location: https://mp.weixin.qq.com/s?__biz=MzU1NzQ3MTg5OQ==&mid=2247483652&idx=1&sn=a16979f8b088cb60fb63f210536d5288&chksm=fc3400f0cb4389e698a5a3ce1bf6a6ab3ff6f547bb4db409893850b0c502053d1fea40f70fda&sessionid=0&scene=126&subscene=0&clicktime=1599463540&enterid=1599463540&ascene=3&devicetype=android-28&version=27001237&nettype=ctnet&abtest_cookie=AAACAA%3D%3D&lang=zh_CN&exportkey=AUPVIV8Yt1hvPJ2dYKFWhvM%3D&pass_ticket=eTzcuEu%2BGavsf30E3HDErOhtb18ThPDhge008pIBzY7AFq0IuG1LUgojTpufwqUZ&wx_header=1
Subject: =?utf-8?Q?=E6=B1=89=E6=9C=8D=E4=B8=A8=E5=BD=BC=E5=B2=B8=E8=8A=B1=E5=BC=80?=
Date: Sun, 20 Sep 2020 00:50:44 -0000
MIME-Version: 1.0
Content-Type: multipart/related;
    type="text/html";
    boundary="----MultipartBoundary--Bx5ubV1DnfL8hvvsySfZL6MQeLa58tWkfwrQGpothO----"
masx200 commented 4 years ago

@obaby

obaby commented 4 years ago

@obaby

这两个浏览器保存的网页开头都是这样的

From: <Saved by Blink>
Snapshot-Content-Location: https://zhuanlan.zhihu.com/p/83130377
Subject: =?utf-8?Q?=E4=BD=A0=E7=9F=A5=E9=81=93=E5=8F=91=E8=B5=B7=E4=B8=80=E6=AC=A1?=
 =?utf-8?Q?DDOS=E6=94=BB=E5=87=BB=E9=9C=80=E8=A6=81=E5=A4=9A=E5=B0=91=E8?=
 =?utf-8?Q?=B4=B9=E7=94=A8=E5=90=97=EF=BC=9F=20-=20=E7=9F=A5=E4=B9=8E?=
Date: Sun, 19 Sep 2020 23:57:55 -0000
MIME-Version: 1.0
Content-Type: multipart/related;
  type="text/html";
  boundary="----MultipartBoundary--VjK26H6J1hen3mSUiigyebg9rwgfVt3ww0WPr7Q2V5----"

From: <Saved by Blink>
Snapshot-Content-Location: https://mp.weixin.qq.com/s?__biz=MzU1NzQ3MTg5OQ==&mid=2247483652&idx=1&sn=a16979f8b088cb60fb63f210536d5288&chksm=fc3400f0cb4389e698a5a3ce1bf6a6ab3ff6f547bb4db409893850b0c502053d1fea40f70fda&sessionid=0&scene=126&subscene=0&clicktime=1599463540&enterid=1599463540&ascene=3&devicetype=android-28&version=27001237&nettype=ctnet&abtest_cookie=AAACAA%3D%3D&lang=zh_CN&exportkey=AUPVIV8Yt1hvPJ2dYKFWhvM%3D&pass_ticket=eTzcuEu%2BGavsf30E3HDErOhtb18ThPDhge008pIBzY7AFq0IuG1LUgojTpufwqUZ&wx_header=1
Subject: =?utf-8?Q?=E6=B1=89=E6=9C=8D=E4=B8=A8=E5=BD=BC=E5=B2=B8=E8=8A=B1=E5=BC=80?=
Date: Sun, 20 Sep 2020 00:50:44 -0000
MIME-Version: 1.0
Content-Type: multipart/related;
  type="text/html";
  boundary="----MultipartBoundary--Bx5ubV1DnfL8hvvsySfZL6MQeLa58tWkfwrQGpothO----"

其实和文件头关系不大,如果你用ie打开chrome保存的文件会发现图片也无法显示。两者的文件格式有所区别,已经着手处理了。很快就会兼容。~~

masx200 commented 4 years ago

因为获取"boundary"的方式不一样,所以我才会说文件头的问题,它的"Content-Type"分好几行

obaby commented 4 years ago

因为获取"boundary"的方式不一样,所以我才会说文件头的问题,它的"Content-Type"分好几行

新版本已经兼容处理了,可以拉取新代码测试,无法解析的原因可以参考我的博客文章:https://wp.me/pbtmY7-1Yo