kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
5 stars 0 forks source link

Add new workers to youtube operations #204

Closed benoit74 closed 5 days ago

benoit74 commented 2 weeks ago

We have a bunch of workers (blade, razor, badger1) which are quite stable in the pool but not yet capable to run youtube recipes.

We should probably whitelist their IPs on Youtube and permit them to run youtube recipes, especially since we have a huge reencoding pipe in front of us with VP8 v2 / VP9 rollout.

We should check if other workers are interested to participate (michaelblob? SpreaderOfKnowledge?)

benoit74 commented 5 days ago

I've whitelisted razor, blade, badger1 (existing new workers) and badger2 and badger3 (workers coming in the next days) on all API keys.

I've also whitelist pixelmemory on few keys where the IP address was missing, look like we fucked-up something here, I don't remember having added pixelmemory

I've added a IP Adresses spreadsheet on https://docs.google.com/spreadsheets/d/1Xa1a54k7po0wtAJQw78YJ06xOY4nHF6z6ekFZP0YXW8/edit?usp=sharing to list all API keys and IP addresses whitelisted.

Many IP addresses whitelisted are unknown to me AND not present in the zimfarm database of workers.

@rgaudin do these IP adresses remember anything to you? Otherwise I suggest to remove all unknown IPs, no reason to keep things which are not used, only going to cause us trouble.

rgaudin commented 5 days ago

Updated the sheet ; only one I don't know. Are all the workers' owners aware of the risk and OK? I suppose you'll follow-up so they enable the offliner in their config anyway.

benoit74 commented 5 days ago

aware of the risk and OK?

Could you develop? I'm not aware (or I've forgotten, sorry) of any risk on this scraper. The only risk I have in mind is the API key been blacklisted, but I don't remember of any risk on worker / IP.

rgaudin commented 5 days ago

Current scraper queries the YT API to retrieve the list of videos to download and some metadata. It has its own quota and can result in API key being blocked. This only concerns us (we manage the API keys).

Once the list of videos is retrieved, scraper downloads video via yt-dlp. At this stage, there is no link anymore to API or anything. The worker is just a random computer downloading videos via yt-dlp.

YT may slow-down or temporarily block IPs of workers downloading videos. This is the risk, especially if the worker shares its IP with a real user ; because the user can have its legitimate usage blocked or because the user's legitimate use , in combination with worker load can trigger the YT cop.

We don't know what triggers those counter measures (speed, frequency of downloads, parallel downloads, tool, etc) so it's difficult to adjust.

Once thing we know is that blockage is annoying on zimfarm mostly. For the worker owner, it lasts usually a few hours or a couple days so it's recoverable but on ZF part, it means tasks failing but then taking new YT task and failing again, consuming all the pipe, producing nothing.

benoit74 commented 5 days ago

OK, then yes, they are aware and IP is dedicated to the Zimfarm worker.

May I delete the obsolete IPs to clean this up?

Should I add pixelmemory host IP in every API key?

Should I add athena18 IPv6 in every API key as well?

rgaudin commented 5 days ago

May I delete the obsolete IPs to clean this up?

Yes, just keep my personal one. It's handy for tests

Should I add pixelmemory host IP in every API key?

I don't think that's necessary.

Should I add athena18 IPv6 in every API key as well?

I think it was left out on purpose but I have no additional information.

benoit74 commented 5 days ago

blade and razor are now operating Youtube scraper as well. badger(s) owner has been informed and will soon reconfigure its worker and a new ones.