openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
84 stars 25 forks source link

Remove ability for youtube scraper to create multiple ZIM (one per playlist) #878

Closed kelson42 closed 5 months ago

kelson42 commented 11 months ago

See https://github.com/openzim/youtube/issues/147.

Recipes using this configuration should be listed and a migration scenario should be decided first.

benoit74 commented 11 months ago

@Popolechien @RavanJAltaie

What do you want to do with these recipes which have the playlist mode enabled?

aimhi_playlists
dse_ladakh_lbj_playlists
keylearning_en
khan-videos_ar_playlists
khan-videos_bn_playlists
khan-videos_en_playlists
khan-videos_es_playlists
khan-videos_fr_playlists
khan-videos_tr_playlists
madrasa_ar_playlists
project-fuel
ruangguru_id_playlists
scienceinthebath_playlist
slam-out-loud_hi
tutorial-wikipedia
ubongo_sw
voa_learning_english_all_playlists
zenius_id_playlists
Popolechien commented 11 months ago

@benoit74 Let us go over them later today and revert back.

Popolechien commented 11 months ago

Add Canadian prepper to the list of zim files that use multiple playlists and need updating, but basically we'll recreate them all with one playlist per recipe / zim file (except Khan, zenius and ruangguru).

benoit74 commented 11 months ago

Canadian prepper has no problem, it is creating one ZIM from a bunch of playlists. What we want to deactivate/remove is when you want to automatically create one ZIM per playlist in the channel / user. Except if it is indeed badly configured and what you wanted is multiple ZIMs, but that's another story.

How should we move forward on this? Should we wait for you to recreate all recipes, or can we simply delete them and you will recreate them on the fly (deleting the recipe will not remove the ZIMs anyway)? Do you need a configuration export so that it will be easier to recreate?

Do you plan to create all recipes manually?

What do you mean by "except Khan, zenus and ruangguru" ? Do you plan to simply delete these recipes and not create anymore ZIMs for these ones?

Popolechien commented 11 months ago

What we want to deactivate/remove is when you want to automatically create one ZIM per playlist in the channel / user.

Ok this I had misunderstood. If there is a way to easily export/duplicate the recipes then by all means let's do it, but otherwise we'll need to recreate them manually (and then only delete the original ones).

Khan, zenius and ruangguru will be deleted entirely (recipes AND zim files).

benoit74 commented 11 months ago

OK, thank you.

Export is easy, it will just be a "raw" copy of the configuration, just so that you have a reference of the old configuration. E.g. for aimhi_playlists (I redacted the secrets), I will export the configuration below and then delete the recipe:

 {
    "api-key": "**********",
    "concurrency": 1,
    "debug": true,
    "format": "webm",
    "id": "PLr5n3ojAJWjSVnG_EK1xF3rW1Lo0N2qwA,PLr5n3ojAJWjQEiRIuHlRoN7rBDKG6GbvH,PLr5n3ojAJWjRGQ1DnnIqDrIXKuuNydGid,PLr5n3ojAJWjTkqDW49ew1u7vsbVzM5Uub,PLr5n3ojAJWjSVh9mgusLb6npGNnFif_Sw,PLr5n3ojAJWjRSuu4s5Vu1CEN0rkgXZ-VA,PLr5n3ojAJWjRDmRmIVAsD4MSEt7wMSOGr,PLr5n3ojAJWjSNVp6jrlwXPz5MyFArv6oO,PLr5n3ojAJWjRiuPrUAAveWrrqnNPPNzYK",
    "indiv-playlists": true,
    "language": "eng",
    "low-quality": true,
    "main-color": "#FFFFFF",
    "optimization-cache": "https://s3.us-west-1.wasabisys.com/?keyId=*****&secretAccessKey=*****&bucketName=org-kiwix-youtube",
    "output": "/output",
    "playlists-description": "The nature-first, curiosity-powered online school for ages 8-18",
    "playlists-name": "aimhi_en_-{title}",
    "playlists-title": "AimHi",
    "playlists-zim-file": "aimhi_en_{slug}_{period}",
    "tags": "aimhi",
    "tmp-dir": "/output",
    "type": "playlist"
}

Is this useful?

Duplicate is something you already have with the "Clone" button. But you still have to input everything else.

From my PoV, this last remark emphasis that:

I don't know if we should live with it, try some quick and dirty wins on some of these topics, or implement a real solution.

Popolechien commented 11 months ago

It is probably for @RavanJAltaie to decide how she wants to proceed, but I'm not sure the export is really useful as neither her nor myself have the skills to create the new recipes via script. Our last discussion was to clone existing recipes (in which case (I'd delete them after the deed is done).

As for next steps, the quick and dirty tends to be somewhat permanent in this house and not exactly convenient for the non-dev end user either: I suggest we park it until this becomes a real project.

benoit74 commented 11 months ago

OK, so next steps before I can start to work on this issue are:

Correct? Note that I'm not speaking about the deletion of unwanted ZIMs, since there is no dependency AFAIK and we can do it at any time, at your own convenience

I'm waiting for your GO to perform the last step which consist in removing the ability to use the "playlists mode" in Zimfarm

Popolechien commented 11 months ago

I realize that @RavanJAltaie was not on the thread and missed that part. I've assigned her now so she can confirm to you when all new recipes have been created

RavanJAltaie commented 11 months ago

Now I'm confused, recipes with multiple playlists are ok? the only problem is deactivating playlist mode? @benoit74

benoit74 commented 11 months ago

@RavanJAltaie Yes, you are right. Recipes with multiple playlists in one ZIM are OK.

benoit74 commented 11 months ago

and yes we just want to get rid of the playlist mode

RavanJAltaie commented 9 months ago

All fixed successfully!

benoit74 commented 9 months ago

Great, thank you!

I reopen the issue because I still have my part of the job to do (remove ability to create youtube recipes which will create multiple ZIMs at once)

benoit74 commented 9 months ago

@RavanJAltaie I'm sorry but madrasa_ar_playlists is still using the Playlists mode, please fix it before I can proceed.

rgaudin commented 9 months ago

It's not clear from this ticket what happened exactly and what will happen:

RavanJAltaie commented 8 months ago

All fixed.

benoit74 commented 8 months ago

Again, I still have my part of the job to do

benoit74 commented 8 months ago

@RavanJAltaie could you please detail recipe per recipe of https://github.com/openzim/zimfarm/issues/878#issuecomment-1851636439 what has been done?

I had a quick look and it seems that in many cases, you simply removed the playlist mode and created one big ZIM instead of many small ones, is this correct? The only exception is madrasa?

When you used this "create only one ZIM instead of many small ones" approach, it looks like you kept the old small ZIMs in the library, is this intentional? Content is evergreen so we do not mind to keep them in the library and not update them anymore?

I'm not convinced by this strategy, usually there was only 5/6 playlists and it did not looked like the number of playlists was frequently updated. Small ZIMs are usually more practical for our users. For https://farm.openzim.org/recipes/voa_learning_en_all for instance, we moved from ZIM ranging from 59.48 MB to 12.72 GB to one enormous (from my perspective at least) 24.93G ZIM. But maybe users are always downloading all ZIMs, so the extra work to create individual ZIMs is not worth it. It is just that this decision is very opaque and has not been explained, so it feels a bit weird.

For madrasa I'm not convinced about the ZIM name / filename. For instance you choose madrasa_astronomy_ar_all while I consider it should be madrasa_ar_astronomy (project is madrasa, selection is astronomy, just like we have wikipedia_en_football, ...)

And for madrasa is there any reason to keep the two disabled recipes? Especially madrasa_ar_playlists which still uses the playlist mode?

RavanJAltaie commented 8 months ago

@benoit74

I had a quick look and it seems that in many cases, you simply removed the playlist mode and created one big ZIM instead of many small ones, is this correct? The only exception is madrasa?

Yes that's correct, this is the decision made by @Popolechien & me after discussing the #878 issue.

I'm not convinced by this strategy, usually there was only 5/6 playlists and it did not looked like the number of playlists was frequently updated. Small ZIMs are usually more practical for our users. For https://farm.openzim.org/recipes/voa_learning_en_all for instance, we moved from ZIM ranging from 59.48 MB to 12.72 GB to one enormous (from my perspective at least) 24.93G ZIM. But maybe users are always downloading all ZIMs, so the extra work to create individual ZIMs is not worth it. It is just that this decision is very opaque and has not been explained, so it feels a bit weird.

That's the strategy followed in creating madrasa playlists, but for the few corrected playlists, we've decided to keep them in one file, but I can re-discuss this with @Popolechien today and change it if agreed upon. Personally I don't think it worths splitting the playlists.

For madrasa I'm not convinced about the ZIM name / filename. For instance you choose madrasa_astronomy_ar_all while I consider it should be madrasa_ar_astronomy (project is madrasa, selection is astronomy, just like we have wikipedia_en_football, ...)

I agree with you, I'll change the naming for all the files and apply this on new creations as well.

And for madrasa is there any reason to keep the two disabled recipes? Especially madrasa_ar_playlists which still uses the playlist mode?

No, no reason, I'll open an issue to delete them.

rgaudin commented 8 months ago

Also, as the convention clearly expresses, Project name instead of domain name should be exceptional. I have the feeling this rule frequently abused. @Popolechien @RavanJAltaie please clarify this

RavanJAltaie commented 8 months ago

Also, as the convention clearly expresses, Project name instead of domain name should be exceptional. I have the feeling this rule frequently abused. @Popolechien @RavanJAltaie please clarify this

in this case the naming for madrasa should be: Youtube_ar_madrasa_astronomy?

rgaudin commented 8 months ago

madrasa.org_ar_astronomy for this one but for all the youtube-only recipes, there a convention needs to be decided, document and followed.

benoit74 commented 8 months ago

in this case the naming for madrasa should be: Youtube_ar_madrasa_astronomy?

We must reserve _ as separator in ZIM name and ZIM filename, i.e. project and selection must use only alphanums + . + -, I will update the convention to make it clearer (speak up if I forgot a needed character).

That's the strategy followed in creating madrasa playlists, but for the few corrected playlists, we've decided to keep them in one file, but I can re-discuss this with @Popolechien today and change it if agreed upon. Personally I don't think it worths splitting the playlists.

No need to discuss it again if it has been agreed upon, just it would have been better to put these conclusions here before so that everyone involved would be aware of this and we keep a track record, I'm pretty sure we will have a question about it in few months.

What about older ZIMs (per playlist), do we keep them in the library? For madrasa, since you are changing the name, you will probably also have to delete older ZIMs.

benoit74 commented 5 months ago

Let's close finish this issue regarding zimfarm ability to create multiple ZIM per recipe for youtube scraper.

Everything else can be discussed separately if needed (and is at least partially already an ongoing effort)