openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

New request: BBC Persian #769

Open benoit74 opened 8 months ago

benoit74 commented 8 months ago

This is a special request for Zimit 2.0 project. Devs will handle this first to test the new scraper, and only once it's working it will be transfered to content team.

benoit74 commented 8 months ago

@Popolechien could you please fill missing info (title, description, icon, ...)? (no urgency)

benoit74 commented 8 months ago

I've created some test Browsertrix crawler outputs that I've uploaded to https://tmp.kiwix.org/ci/bbc_persian_2023_12/

I've limited the depth with the --depth parameter. There is for now two files, one with --depth 1 (the home page and all pages referenced on this home) and --depth 2. I'm trying to run a --depth 3 but I'm afraid I might run out of disk space on my server.

As discussed, we should probably use these ones as a reference for testing zimit2 since most changes are done in the transformation of Browsertrix WARC files into a ZIM.

I've not tested the quality of the WARC files so far, I hope content is appropriate, let me know if you face any issue.

benoit74 commented 8 months ago

@rgaudin @kelson42 @RavanJAltaie FYI, this issue might interest you as well

Popolechien commented 8 months ago

@benoit74 I've updated the request parameters

mgautierfr commented 8 months ago

Here is a zim version (depth 2) of the warc : https://tmp.kiwix.org/ci/bbc_persian_2023_12/bbc_depth2.zim

Is is made with the last version of https://github.com/openzim/warc2zim/pull/142. It seems to works. I have found two issues:

RavanJAltaie commented 8 months ago

@benoit74 do we need to create a recipe for this? or it will be totally taken care of by you as a part of Zim 2.0 project?

kelson42 commented 8 months ago

@RavanJAltaie Ticket is assigned to them, they will take care and indeed create a recipe.

benoit74 commented 8 months ago

@mgautierfr did you tried to open any article? I tested it with kiwix-serve 3.6.0 and everytime I open one article (i.e. not the main page but any thumbnail on this main page) it starts by opening an "in-ZIM" URL and then it looks like something is triggerred which finally opens the real article on the real online website instead of staying on the local one. Did you experienced this as well?

benoit74 commented 8 months ago

Also note that I uploaded a --depth=3 file as well now: https://tmp.kiwix.org/ci/bbc_persian_2023_12/bbc_persian_crawl_depth_3.tar.gz

mgautierfr commented 8 months ago

it starts by opening an "in-ZIM" URL and then it looks like something is triggerred which finally opens the real article on the real online website instead of staying on the local one. Did you experienced this as well?

@benoit74 you have to test with this PR: https://github.com/kiwix/libkiwix/pull/1036

benoit74 commented 8 months ago

Arg, ok ... Sorry about that.

benoit74 commented 7 months ago

Recipe is ready: https://farm.openzim.org/recipes/bbc.com_persian

@Popolechien or @RavanJAltaie (at least, others are welcomed), could you please review it to confirm it is ok before I enable it? It will probably run for many days, so it is better to confirm the recipe configuration (name, description, ...) matches our expectations before starting a run

kelson42 commented 7 months ago

Last task failed https://farm.openzim.org/pipeline/d242f22c-2794-453c-a4a0-8dab259275fd/debug

benoit74 commented 7 months ago

Yup, the browser crashed. Let's start it again.

benoit74 commented 6 months ago

Blocked by website provider, all requests are finishing with the classic 90 secs timeout.

No idea how to move forward this request for now.

benoit74 commented 4 months ago

Zimit2 is now succeeding to crawl the website thanks to browsertrix crawler 1.x

However, there is an issue regarding disk usage. Exact problem is unclear, because the crawler stops saying Disk utilization projected to reach threshold 90% > 90%, stopping while the projected size is not expected to be different from the actual size since we do not run crawler with combineWARC or generateWACZ args (see https://github.com/webrecorder/browsertrix-crawler/blob/15d2b0975785dfaeda165252f53b834f1af2b455/src/util/storage.ts#L231).

I've created a custom Zimit image with more logs in the crawler when the problem arises and disabled detection of projected disk size (only actual size will be used for this run): https://farm.openzim.org/pipeline/a1702f94-7da3-476e-9d78-3d9c2090efcd

benoit74 commented 3 months ago

Just uploaded a new WARC which is supposed to be complete this time: https://tmp.kiwix.org/ci/test-warc/bbc.com_persian_2024_05_07/627e21ac-23e0-450b-be25-1aac8601a9a6_zimit.tar

benoit74 commented 3 months ago

Videos are not working at all. Looks like a crawling issue, because WARC is not playing either in replayweb.page

Test page : https://www.bbc.com/persian/articles/c3g8wrnd08no

Command used: docker run -v $PWD/output:/output --name crawlme --rm webrecorder/browsertrix-crawler:1.1.3 crawl --failOnFailedSeed --behaviors "autoplay,autofetch,autoscroll" --url https://www.bbc.com/persian/articles/c3g8wrnd08no --mobileDevice "Pixel 2" --cwd /output --combineWARC --depth 0

Test WARC : https://tmp.kiwix.org/ci/test-warc/bbc-video-20240530.warc.gz

benoit74 commented 2 months ago

So I confirm that for this ZIM there is only a problem with videos and with live streams (live coverages with text and other dynamic stuff).

The problem is that video is obviously quite significant on this ZIM, I don't know how feasible this is to be fixed on crawler side.