jshemas / openGraphScraper

Node.js scraper service for Open Graph Info and More!
MIT License
643 stars 102 forks source link

Some url should not be parsed #135

Closed YCKang closed 2 years ago

YCKang commented 2 years ago

Hi, @jshemas.

Currently, this module check the url using utils.isThisANonHTMLUrl(options.url) before request the url. But some url is a file link and the extension is not in the invalidImageTypes array, or the link has no extension even. This module may cause high CPU usage due to parsed a non HTML link (actually is a file).

Although, some rare case the content-type may not exists in the response header #45 ('https://www.namecheap.com/' add the content-type in the response header now) I think the misjudge of the nonHTML link is more often than the HTML link has no content-type. Maybe you should add the check back ?

p.s. I found another tool that only accept 'text/html', 'application/xhtml+xml'. https://github.com/niallkennedy/open-graph-protocol-tools/blob/ac1f238f52088be9fb220df0dd9ef3b2fb452b82/open-graph-protocol.php#L548

thx.

jshemas commented 2 years ago

Hello. I added a content-type check in open-graph-scraper@4.10.0. Let me know if that fixes your problem.

Code -> https://github.com/jshemas/openGraphScraper/blob/master/lib/request.js#L21-L23

YCKang commented 2 years ago

Thank you, it fixes the problem. And the "downloadLimit" feature is also awesome!