openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
42 stars 3 forks source link

New request: hra-news.org #832

Closed benoit74 closed 2 months ago

benoit74 commented 8 months ago

This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.

Recipe already created here: https://farm.openzim.org/recipes/hra-news.org_persian

benoit74 commented 8 months ago

@RavanJAltaie why did you cancelled the task: https://farm.openzim.org/pipeline/d181a07b-62ae-4c20-b903-2b9bd92b4b07 and started it again?

This issue is assigned to me, so at least please keep me informed when you do maintenance work, I have no idea what happened and we were awaiting the completion of this task for many days now!

RavanJAltaie commented 8 months ago

I've been informed by @Popolechien to keep monitoring the recipes pipeline, whenever there is a recipe taking longer that 1 week, then I need to cancel it and re-request it again. We can discuss stopping this exercise as I notice it's causing a lot of confusion here.

benoit74 commented 8 months ago

Yes, we need to discuss it, we need to adapt this, the world is not black and white, your actions need to be way "smoother".

What we probably need (and I think that almost everyone in dev team is aligned, we've already discussed about it and Stephane confirmed he also understood that) is that whenever a recipe is taking longer that 7 days, you start you to pay attention to it. Whenever it gets close to 30 days, the whole content team is warned and task is probably cancelled.

Proposed workflow

From 7 days up to 30 days, you only have to check two things: is the task taking way longer than usual / last runs? is the task still progressing?

First, check if the task is taking way longer than usual / last runs. If yes, then you should probably cancel it and request it again, but only once. Could be a temporary glitch not worth investigating. After that, you have to open a ticket to ask for help.

Then you check if the task is still progressing, i.e. do we regularly have new logs appearing? do we have the stats (progress percentage) regularly updated? If there is clearly no progress for more than 24 hours, then you have to check of this is not a task which has been manually submitted by someone. If it has been manually requested by someone, you have to open a ticket and ping this person who will probably take care of the next steps. If it has been automatically requested by the periodic-scheduler, you have to cancel it and re-request it again, only once. If it happens two times in a row, open a ticket and ask for help.

One thing important to note is the "no progress during 24 hours". We are not in hurry to cancel a task which already ran for 7 days, take some time to ensure we do not waste resources by canceling something which is still progressing.

If there is some progress on the task, you simply have to continue to monitor if everything is fine (twice a week is typically enough).

Once the task arrives close to the 30 days limit (and if it is still progressing, otherwise we are in the situation above), you should open a ticket and ask for help. Depending on the progress already made and the expected duration of the task, we might decide to let the task finish or decide to cancel it. This has to be a collective decision of the content team.

Additional remarks

In any case, for every task which is running for more than 7 days, I think you should open a zim-request ticket indicating what is going on. This will help everyone know that something is happening, that you are monitoring the situation, and what actions you've already taken. It will shed the light on how much maintenance work you are doing in addition to creating new recipes.

Maybe the 7 days limit is a bit too short and we should target something like 10 days or even more, I don't know, we will have to experiment on this, but the only way to know is to start with a low threshold and observe that in too many cases we have spent time monitoring something which was finally all good.

Of course, we need to discuss this live, you need to speak up for everything which is not working for you, and we need document all this in a nice workflow. And we will probably continue to adapt this in the coming months and years, this is only a first proposition / draft version of a Zimfarm monitoring workflow.

RavanJAltaie commented 8 months ago

@benoit74 Noted

benoit74 commented 5 months ago

A test ZIM is ready and mostly functional (I would say 90%) at https://dev.library.kiwix.org/viewer#hra-news-org_far_all_2024-05/

Known issues:

image

image

E.g. https://dev.library.kiwix.org/viewer#hra-news-org_far_all_2024-05/www.hra-news.org/periodical/a-164/

image

benoit74 commented 5 months ago

Just created the custom CSS: https://drive.farm.openzim.org/zimit_custom_css/www.hra-news.org.css

benoit74 commented 5 months ago

Regarding "Some images on main page carousel are not displaying at all, they are even absent from the HTML code (to be investigated)" issue, the statement was not totally correct.

HTML code is OK. Problem is only that some images of the carousel have not been fetched. Passing --waitUntil load,networkidle0 does not help. Developing a custom behavior that would click on all carousel buttons would solve the issue (and this behavior should trigger only on home page of www.hra-news.org). To do later, not an urgent topic / big blocker for ZIM usage AFAIK.

benoit74 commented 5 months ago

So to summarize, zones in green below have been hidden (search box, link to "external" websites / donation / ..., form to submit comments) ; zone in purple is the carousel where some images are missing.

image

image

benoit74 commented 5 months ago

In fact, looking at the logs with more scrutiny, it is clear that the website is not complete, crawling has been interrupted by disk full (again ...).

Popolechien commented 5 months ago

crawling has been interrupted by disk full

Like, how big can this site possibly be (or were our settings too low)?

benoit74 commented 5 months ago

Worker settings were low, only 30G disk space was available and other tasks were running at the same time.

But it is still a good question.

Crawler had fetched only 93238 pages and had already found 167833 pages to process. Warc size was already 23G, so if no new pages were going to be discovered (which we know is not the case), the Warc would be at least 41G.

23G for 93238 pages gives us 246k per page in average. For only photos and HTML, it is not negligible but I've found quite a lot of "big" pictures (hundreds of KB) while browsing randomly (e.g https://www.hra-news.org/periodical/a-166/ has a 373k picture in the upper left corner) + there are PDF reports (e.g. https://www.hra-news.org/periodical/a-157/) and videos on few pages (e.g. https://www-hra--news-org.translate.goog/periodical/a-144/), so this could easily explain the storage consumption.

The even better question would be: are we sure our client is going to be able to cope with so big ZIMs? Should we only fetch most recent news? Looking at the logs, hra-news.org has articles up to 2010 online, not sure this is still valuable information. But maybe it is going to be hard to remove from the crawl and not going to help much in archive size anyway since we might have to keep PDFs and videos for instance.

benoit74 commented 4 months ago

Task is still running since more than 21 days and stats are 392574 / 465866. I've closely monitored the task since few days and it looks like crawler is only looking after tag pages (e.g. https://www.hra-news.org/tag/%d8%a7%d8%b9%d8%aa%d8%b1%d8%a7%d8%b6%d8%a7%d8%aa-%d8%b1%d9%88%d8%b2%d8%a7%d9%86%d9%87/) which themselves have many pages. And it looks like there are many many tags in the system, see e.g. https://www.hra-news.org/2024/hranews/a-49000/ I've modified the settings to ignore these tags pages, hopefully it will make the run complete way faster. I've also modified the custom CSS to hide these tags on article pages. I did not found them anywhere else, but I might have missed something.

benoit74 commented 4 months ago

The last task failed on the Zimfarm due to known issues which have been solved since the task started.

Since WARC have been produced, I ran the warc2zim conversion locally on my machine and it succeeded in producing a ZIM which looks quite promising from my PoV: https://dev.library.kiwix.org/viewer#hra-news.org_far_all_2024-06/ (unfortunately dev library is currently very slow, don't be afraid by these slowness which I did not expected when testing the ZIM locally)

@Popolechien could you please have a look at this ZIM, and if happy you can probably advertise this one as well to our client

Popolechien commented 4 months ago

LGTM, thanks!

benoit74 commented 3 months ago

A new ZIM has been produced fully automatically this time, with updated content. I hence enabled the recipe which is currently planned to update quarterly. I will close this issue, work is definitely completed for this recipe.

benoit74 commented 2 months ago

Reopening, file has not yet been moved to prod

benoit74 commented 2 months ago

ZIM is ready at https://library.kiwix.org/viewer#hra-news.org_far_all or https://download.kiwix.org/zim/zimit/hra-news.org_far_all_2024-09.zim