openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

New request: ir.voanews.com #833

Open benoit74 opened 6 months ago

benoit74 commented 6 months ago

This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.

Recipe already created here: https://farm.openzim.org/recipes/ir.voanews.com_persian

benoit74 commented 6 months ago

Task failed: https://farm.openzim.org/pipeline/4889a582-f24d-4364-acad-507c5d94ced6/debug

Cause is https://github.com/openzim/zimit/issues/266

I'm not restarting the recipe, it is clear that something needs to be changed upstream first.

benoit74 commented 3 months ago

Last WARC seems to be mostly OK at https://tmp.kiwix.org/ci/test-warc/ir.voanews.com_persian_2024-04-19/d4cbebe4-c8d3-4729-a083-bf5801beab92_zimit.tar

Conversion to ZIM is completing, but:

benoit74 commented 2 months ago

Sample WARC content (builded with the crawler with --mobileDevice Pixel2) for https://ir.voanews.com/a/un-afghanistan-taliban-doha-meeting-women-rights/7681308.html (only content from gdb.voanews.com which seems to be the image CDN is listed, and URLs are sorted alphabetically for convenience):

https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-5a3d-08dc93d604bf_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png
https://gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w408_r1_s.png
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w144_r1.jpg
https://gdb.voanews.com/26a00de1-2f39-4630-a996-147d0df3e447_w33_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w144_r1.jpg
https://gdb.voanews.com/4315c626-5239-4be0-956b-62af07c4aea1_w33_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w144_r1.jpg
https://gdb.voanews.com/59b57984-6b9c-4fc1-af95-4dff47d441df_w33_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w144_r1.jpg
https://gdb.voanews.com/5add67da-1e19-48f6-98ca-3cd8d0bbbe01_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w144_r1.jpg
https://gdb.voanews.com/d651d75e-bae4-4045-905f-c32fc9bff412_w33_r1.jpg

And for https://ir.voanews.com/a/iran-elections-opposition-dissidents-figures-boycott-call/7681344.html:

https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w144_r1.jpg
https://gdb.voanews.com/01000000-0a00-0242-bf4f-08dc99a71c65_tv_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-7360-08db49c8d311_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-9619-08dc585bd454_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-a286-08dc991b6a34_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w144_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w33_r1.jpg
https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w408_r1_s.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w144_r1.jpg
https://gdb.voanews.com/01000000-c0a8-0242-2cac-08dc1e96319d_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w144_r1.jpg
https://gdb.voanews.com/6d70dc8b-a197-4b88-be27-9a5379886d04_w33_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w144_r1.jpg
https://gdb.voanews.com/89e0a08f-6e2b-407f-8b24-44c9d6a6f4e1_w33_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w144_r1.jpg
https://gdb.voanews.com/AC9B16E7-73CC-4C5B-BACB-CB0AC37D2A41_cx0_cy7_cw0_w33_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w144_r1.jpg
https://gdb.voanews.com/b0095b01-57d9-46b4-9b00-0c6d01fa9277_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_cx0_cy6_cw0_w33_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w144_r1.jpg
https://gdb.voanews.com/b9f301d7-884c-4194-83ff-fd95cb1e5b96_w33_r1.jpg

What we see is that:

When opening the same pages on desktop it tries to load a different resolution (and adds a _s suffix to the URL):

When click the "high res" button available on the page it loads a different resolution and a name pattern is slightly different:

I did not found cases where multiple "big" images where present on a single article, there was always only one single "big" image.

Upstream server is in fact resizing the image on-demand, you can request any resolution, this is not pre-computed in advance. The _n, _st, _r1 are flags used to enable / disable some watermarks / overlays (e.g _st activates an subtitle in upper right corner). The _cx0_cy6_cw0 is used to change the center of the image (probably cropping few pixels).

It is also important to note that creating fuzzyrules for this is made more complicated by the fact that the lowest resolution are fetched first, so a fuzzyrules covering all resolution will technically work but store only crappy 33pixels images in the ZIM and website will be significantly degraded.

So the conclusion is that:

Do we need to wait for https://github.com/openzim/warc2zim/issues/271 ? (not even sure how this could exactly solve the issue)

benoit74 commented 2 months ago

I've identified fuzzy rules which might work indeed:

  - pattern: gdb.voanews.com/(.*_w33_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*_w144_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*_w250_.*)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1
  - pattern: gdb.voanews.com/(.*)_w.*(\..*?)
    replace: gdb.voanews.com.fuzzy.replayweb.page/\1_high\2

Associated JS tests:


test('gdb.voanews.com_1', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w1023_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_high.png',
  );
});

test('gdb.voanews.com_2', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1597_n_r1_st_s.jpg'),
    'gdb.voanews.com.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_high.jpg',
  );
});

test('gdb.voanews.com_3', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w33_r1_s.png',
  );
});

test('gdb.voanews.com_4', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w144_r1_s.png',
  );
});

test('gdb.voanews.com_5', (t) => {
  t.is(
    applyFuzzyRules('gdb.voanews.com/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png'),
    'gdb.voanews.com.fuzzy.replayweb.page/10050000-0aff-0242-f2d3-08da53b2fd74_w250_r1_s.png',
  );
});

Unfortunately, they do not work due to another limitation in warc2zim (I'll open a ticket right now)

benoit74 commented 1 day ago

Marking as done in zimit2 project since we are not going to complete this task as part of the project