jbsparrow / CyberDropDownloader

Bulk Gallery Downloader for Cyberdrop.me and Other Sites
GNU General Public License v3.0
156 stars 13 forks source link

[BUG] scrape mistakes xxxbunker for xbunker #174

Closed baccccccc closed 1 day ago

baccccccc commented 5 days ago

this is probably the dumbest bug title ever, but nevertheless it's a thing.

here's example URL that I encountered when scraping some forum. (NSFW.)

https://xxxbunker.com/4033557

this site is probably unsupported by CDL and hence this URL should be logged to Unsupported_URLs.txt.

however, it looks like CDL mistakes this domain for xbunker.nu and tries to apply xbunker_crawler.py there. Which, understandably, fails.

INFO     : 2024-10-18 23:43:38,942 : utilities.py:114 : Scrape Starting: https://xxxbunker.com/4033557
ERROR    : 2024-10-18 23:43:41,890 : utilities.py:114 : Scrape Failed: https://xxxbunker.com/4033557 ('NoneType' object has no attribute 'find_all')
ERROR    : 2024-10-18 23:43:41,891 : utilities.py:114 : Traceback (most recent call last):
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\utils\utilities.py", line 72, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\cyberdrop_dl\scraper\crawlers\xbunker_crawler.py", line 95, in forum
    for elem in title_block.find_all(self.title_trash_selector):
                ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find_all'

INFO     : 2024-10-18 23:43:41,893 : utilities.py:114 : Scrape Finished: https://xxxbunker.com/4033557
jbsparrow commented 1 day ago

This is a potential issue with pretty much all domains because of how we match URLs :/

We just match the portion of the hostname (e.g. coomer, simpcity, bunkr), so that's why it detected xxxbunker as xbunker. I do plan on fixing it at some point because I see it as a potential security vulnerability. Somebody could mimic another site and could upload malicious zips or other files that get downloaded because CDL mistook the website for another.

baccccccc commented 1 day ago

Somebody could mimic another site and could upload malicious zips or other files that get downloaded because CDL mistook the website for another.

why would they not upload the same malicious zip to a legit supported site such as bunkr?

I think the only scenario when this becomes a security issue if someone manages to trick CDL into executing malicious code. Say, there's a way to manipulate the web page source in a very special way that exploits some wicked vulnerability in CDL parser. So that when CDL parses the page, it triggers some execution logic. You probably cannot do that on the "real" bunkr but it might work if you set up a "fake" bunkr. That would be very bad indeed, but the likelihood is extremely low IMO.

jbsparrow commented 1 day ago

Yeah I don't think there's any big issues like code execution. Someone could definitely upload a malicious file to the real website, but the real websites have moderation and people who report bad content.

Someone could also upload illegal content or just spam content.

The likelihood of someone doing any of this is extremely low but it is also just a flaw with how we check website hosts. The issue with fixing it is that a lot of these websites need new domains fairly often, so we would need to stay up to date on all the active hostnames and TLDs. I do have some ideas that I want to look into at some point but it's not a very high priority because there's not a large vulnerability here.