Closed benoit74 closed 2 months ago
Impacted by upstream issue for now: https://github.com/openzim/warc2zim/issues/188
Now impacted by https://github.com/openzim/warc2zim/issues/261
Issues mentioned above have been solved / are not occurring anymore.
Problem now is that we are blocked by Cloudflare after some times, it looks like all request finishes with 403 errors at some point. We are getting into contact with iranwire.com persons to find a solution (IP whitelisting, ...).
Some of our worker IPs (ondemand IPv4 and IPv6, athena18 IPv4 and IPv6 and pixelmemory IPv4) have been whitelisted from iranwire.com.
Crawl completed successfully and produced the WARC:
Conversion to ZIM failed due to known bug in 2.0.1, since then fixed in 2.0.2.
What we now see is that:
^.*ZimPath\((?:t\.me|www\.facebook|www\.reddit|twitter\.com|api\.whatsapp|iranwire\.com\/login|iranwire\.com\/register|www\.addtoany).*$
, only 10% of the log remains ; and if we focus on iranwire.com, we have about 10k unique resources missing, and most of them are images, which is linked to next item ->All this seems to be feasible to be fixed with some engineering efforts
For the record, see https://kiwix.freshdesk.com/a/tickets/71198 for some details around IP whitelisting
After some investigation, it looks like I was wrong in my previous analysis of why images are missing. The autofetch
behavior is supposed to grab them all. I don't get why the WARC is incomplete then. I will start again the recipe with a low limit on how many pages to fetch, just to confirm how it is working (or not).
I've investigated also the video issue. I've succeeded to write a custom behavior to trigger the play of the youtube video, however it does not wait for the player to really start and there is no video in the WARC in the end.
For reference, this is the custom behavior I used
// custom behavior for iranwire.com website: automatically start the videos since Youtube player
// is not inside the DOM until play button is clicked.
class IranWireCom {
static get id() {
return "IranWireCom";
}
static isMatch() {
const pathRegex = /https:\/\/iranwire\.com\//;
return !!window.location.href.match(pathRegex);
}
static init() {
return {
state: { playbuttons: 0 },
};
}
async* run(ctx) {
const { xpathNodes, scrollAndClick, getState } = ctx.Lib;
const playButtons = xpathNodes("//*[contains(@class,'video-component-play')]");
for await (const playButton of playButtons) {
scrollAndClick(playButton);
yield getState(ctx, "Video play button", "playbuttons");
}
yield "IranWireCom Behavior Complete";
}
}
When placed inside a custom-behaviors
subfolder it is simply activated by passing -v $PWD/custom-behaviors:/custom-behaviors
to docker command and --customBehaviors /custom-behaviors/
to the crawler, not forgetting to activate siteSpecific
behavior with --behaviors "siteSpecific,..."
.
I tried to change the order of behaviors to check if it might have an impact but without success.
@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?
@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?
Yes, as temporary solution
OK so I finally achieved to find the problem for the images: for some reasons, JS code is adding an inline visibility: hidden
style to the first image of every article. I struggle to find the JS responsible for this, so for now we will live with a CSS trick/hack to restore original visibility.
I've created the custom CSS to get rid of the bug, of ads and social links and search boxes.
I've also reconfigured the recipe to include few useful pages which are not inside the /fa
suffix but still in Farsi as far as I can tell (authors, petitions and questions).
Last task execution with only few pages (100) proved to create a ZIM which seems OK. I will relaunch again the recipe on all pages and let's see what comes out in more or less 1 week.
Last task did not produced a full ZIM at all because we now need to add the base URL in include
regexp.
Petitions pages are not displaying any image due to https://github.com/openzim/zimit/issues/316
So I've modified the include
regex to not include them for now: iranwire.com(?:$|\/$|\/author\/|\/petition\/|\/questions\/|\/fa\/)
. I've relaunched the task first with depth
set to 1 to confirm the include regex and the CSS custom are OK.
New recipe configuration is now scraping tons of "stupid" pages like https://info@iranwire.com/fa/features/38626/
with a info
user.
I've canceled the recipe, modified the include
setting to not includes these pages and re-requested the recipe.
The link you gave looks like a regular entry - what is "stupid" about it?
Thank you for asking! The "stupid" thing is that URL is info@iranwire.com
instead of iranwire.com
. This looks like a bug on their side on a random page which is suddenly duplicating all entries to fetch + store in the ZIM (once for info@iranwire.com
and once for iranwire.com
).
What I said is not totally correct. This is stupid because the info@
part is anyway dropped by warc2zim, so in the end we will store only one entry and all links with info@
will be rewritten without it. So we are "only" loosing our time fetching to many pages (but this is already lot of time lost).
It looks like last ZIM is ready for review: https://dev.library.kiwix.org/content/iranwire-com_far_all_2024-07/
What is known to not work:
@Popolechien please review the ZIM to identify whatever needs to be fixed before communicating the ZIM to the client
Moved to prod since we do not have more feedbacks since weeks, it is supposed to be OK.
This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.
Recipe already created here: https://farm.openzim.org/recipes/iranwire.com_persian