Open benoit74 opened 8 months ago
@Popolechien could you please fill missing info (title, description, icon, ...)? (no urgency)
I've created some test Browsertrix crawler outputs that I've uploaded to https://tmp.kiwix.org/ci/bbc_persian_2023_12/
I've limited the depth with the --depth
parameter. There is for now two files, one with --depth 1
(the home page and all pages referenced on this home) and --depth 2
. I'm trying to run a --depth 3
but I'm afraid I might run out of disk space on my server.
As discussed, we should probably use these ones as a reference for testing zimit2 since most changes are done in the transformation of Browsertrix WARC files into a ZIM.
I've not tested the quality of the WARC files so far, I hope content is appropriate, let me know if you face any issue.
@rgaudin @kelson42 @RavanJAltaie FYI, this issue might interest you as well
@benoit74 I've updated the request parameters
Here is a zim version (depth 2) of the warc : https://tmp.kiwix.org/ci/bbc_persian_2023_12/bbc_depth2.zim
Is is made with the last version of https://github.com/openzim/warc2zim/pull/142. It seems to works. I have found two issues:
www.bbc.com/persian/live/67778802
) accessible from the main page by the link with a "live" icon (a dot in a circle)@benoit74 do we need to create a recipe for this? or it will be totally taken care of by you as a part of Zim 2.0 project?
@RavanJAltaie Ticket is assigned to them, they will take care and indeed create a recipe.
@mgautierfr did you tried to open any article? I tested it with kiwix-serve 3.6.0 and everytime I open one article (i.e. not the main page but any thumbnail on this main page) it starts by opening an "in-ZIM" URL and then it looks like something is triggerred which finally opens the real article on the real online website instead of staying on the local one. Did you experienced this as well?
Also note that I uploaded a --depth=3
file as well now: https://tmp.kiwix.org/ci/bbc_persian_2023_12/bbc_persian_crawl_depth_3.tar.gz
it starts by opening an "in-ZIM" URL and then it looks like something is triggerred which finally opens the real article on the real online website instead of staying on the local one. Did you experienced this as well?
@benoit74 you have to test with this PR: https://github.com/kiwix/libkiwix/pull/1036
Arg, ok ... Sorry about that.
Recipe is ready: https://farm.openzim.org/recipes/bbc.com_persian
@Popolechien or @RavanJAltaie (at least, others are welcomed), could you please review it to confirm it is ok before I enable it? It will probably run for many days, so it is better to confirm the recipe configuration (name, description, ...) matches our expectations before starting a run
Yup, the browser crashed. Let's start it again.
Blocked by website provider, all requests are finishing with the classic 90 secs timeout.
No idea how to move forward this request for now.
Zimit2 is now succeeding to crawl the website thanks to browsertrix crawler 1.x
However, there is an issue regarding disk usage. Exact problem is unclear, because the crawler stops saying Disk utilization projected to reach threshold 90% > 90%, stopping
while the projected size is not expected to be different from the actual size since we do not run crawler with combineWARC or generateWACZ args (see https://github.com/webrecorder/browsertrix-crawler/blob/15d2b0975785dfaeda165252f53b834f1af2b455/src/util/storage.ts#L231).
I've created a custom Zimit image with more logs in the crawler when the problem arises and disabled detection of projected disk size (only actual size will be used for this run): https://farm.openzim.org/pipeline/a1702f94-7da3-476e-9d78-3d9c2090efcd
Just uploaded a new WARC which is supposed to be complete this time: https://tmp.kiwix.org/ci/test-warc/bbc.com_persian_2024_05_07/627e21ac-23e0-450b-be25-1aac8601a9a6_zimit.tar
Videos are not working at all. Looks like a crawling issue, because WARC is not playing either in replayweb.page
Test page : https://www.bbc.com/persian/articles/c3g8wrnd08no
Command used: docker run -v $PWD/output:/output --name crawlme --rm webrecorder/browsertrix-crawler:1.1.3 crawl --failOnFailedSeed --behaviors "autoplay,autofetch,autoscroll" --url https://www.bbc.com/persian/articles/c3g8wrnd08no --mobileDevice "Pixel 2" --cwd /output --combineWARC --depth 0
Test WARC : https://tmp.kiwix.org/ci/test-warc/bbc-video-20240530.warc.gz
So I confirm that for this ZIM there is only a problem with videos and with live streams (live coverages with text and other dynamic stuff).
The problem is that video is obviously quite significant on this ZIM, I don't know how feasible this is to be fixed on crawler side.
This is a special request for Zimit 2.0 project. Devs will handle this first to test the new scraper, and only once it's working it will be transfered to content team.