mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.69k stars 954 forks source link

Pixiv announces new countermeasures against crawlers #4072

Closed pink-red closed 10 months ago

pink-red commented 1 year ago

https://www.pixiv.net/info.php?id=9541

② Software being used for unauthorized aggregation of a creator's work

https://inside.pixiv.blog/2023/05/17/102629

In addition to the countermeasures introduced above, pixiv Inc. is currently working on new measures for preventing large-scale data acquisition for malicious purposes and other forms of malicious activity on our platforms.

Just wanted to share this, since this could eventually result in accounts being banned when using gallery-dl.

Always use throwaway accounts!

9696neko commented 1 year ago

From the two posts I could not discern whether non-malicious scraping is allowed. Maybe robots.txt can explain more...

Use of Cloudflare is kind of scary because they run on so many properties on the internet, they could easily fingerprint users.

Pixiv says they already have high false positives in some areas, so as to need holding back on measures, but Machine Learning may yield better targeted results.

I would argue you are more likely to have trouble with low reputation accounts, like throwaways!

pink-red commented 1 year ago

whether non-malicious scraping is allowed

https://policies.pixiv.net/en.html

  1. Other prohibited acts
    1. Collection of information using crawlers and other such programs;

And I would be really surprised if it wasn't prohibited. Crawling is prohibited on most websites, and the remaining ones usually just don't care instead of explicitly allowing it. If a website wants you to interact with it programmatically, it will provide an official API and documentation.

Use of Cloudflare is kind of scary

Pixiv says that it's already used, so not that scary.

I would argue you are more likely to have trouble with low reputation accounts, like throwaways!

Having trouble with a throwaway account is better that unexpectedly getting your main account banned.

In any case, as with any announcement, take it with a grain of salt. We don't know yet, how well these measures will actually work and what could be done about them. I would say: take caution and don't use your main account, we'll see what happens next.

ClosedPort22 commented 1 year ago

Both the web interface and the reverse-engineered mobile API (which is what gallery-dl uses) are using Cloudflare's bot management solutions. Interestingly, I've never seen users report that they get Cloudflare challenges when using gallery-dl, but this seems to happen from time to time for projects that are using headless browsers to access the endpoints (e.g. https://github.com/upbit/pixivpy/issues/259).