New request: iranwire.com

benoit74 commented 8 months ago

This is a subtask of https://github.com/openzim/zim-requests/issues/826 for tracking recipe progress one by one and avoid confusion.

Website URL: https://iranwire.com/fa/

Recipe already created here: https://farm.openzim.org/recipes/iranwire.com_persian

benoit74 commented 8 months ago

Impacted by upstream issue for now: https://github.com/openzim/warc2zim/issues/188

kelson42 commented 6 months ago

Now impacted by https://github.com/openzim/warc2zim/issues/261

benoit74 commented 5 months ago

Issues mentioned above have been solved / are not occurring anymore.

Problem now is that we are blocked by Cloudflare after some times, it looks like all request finishes with 403 errors at some point. We are getting into contact with iranwire.com persons to find a solution (IP whitelisting, ...).

benoit74 commented 4 months ago

Some of our worker IPs (ondemand IPv4 and IPv6, athena18 IPv4 and IPv6 and pixelmemory IPv4) have been whitelisted from iranwire.com.

Crawl completed successfully and produced the WARC:

Conversion to ZIM failed due to known bug in 2.0.1, since then fixed in 2.0.2.

What we now see is that:

the crawling seems to be mostly complete, we do not see many resources missing (once we remove crap from twitter/facebook/addtoany with something like ^.*ZimPath\((?:t\.me|www\.facebook|www\.reddit|twitter\.com|api\.whatsapp|iranwire\.com\/login|iranwire\.com\/register|www\.addtoany).*$, only 10% of the log remains ; and if we focus on iranwire.com, we have about 10k unique resources missing, and most of them are images, which is linked to next item ->
we miss a significant number of images, implementing https://github.com/openzim/zimit/issues/316 would probably solve the problem
the videos are missing from the WARC, because the Youtube player is "hidden" behind a picture click event, i.e. it is dynamically added to the page when the user click the video ; autoplay behavior hence fails to find the video and does not trigger

All this seems to be feasible to be fixed with some engineering efforts

benoit74 commented 4 months ago

For the record, see https://kiwix.freshdesk.com/a/tickets/71198 for some details around IP whitelisting

benoit74 commented 4 months ago

After some investigation, it looks like I was wrong in my previous analysis of why images are missing. The autofetch behavior is supposed to grab them all. I don't get why the WARC is incomplete then. I will start again the recipe with a low limit on how many pages to fetch, just to confirm how it is working (or not).

benoit74 commented 4 months ago

I've investigated also the video issue. I've succeeded to write a custom behavior to trigger the play of the youtube video, however it does not wait for the player to really start and there is no video in the WARC in the end.

For reference, this is the custom behavior I used

// custom behavior for iranwire.com website: automatically start the videos since Youtube player
// is not inside the DOM until play button is clicked.

class IranWireCom {

    static get id() {
        return "IranWireCom";
    }

    static isMatch() {
        const pathRegex = /https:\/\/iranwire\.com\//;
        return !!window.location.href.match(pathRegex);
    }

    static init() {
        return {
            state: { playbuttons: 0 },
        };
    }

    async* run(ctx) {
        const { xpathNodes, scrollAndClick, getState } = ctx.Lib;
        const playButtons = xpathNodes("//*[contains(@class,'video-component-play')]");
        for await (const playButton of playButtons) {
            scrollAndClick(playButton);
            yield getState(ctx, "Video play button", "playbuttons");
        }
        yield "IranWireCom Behavior Complete";
    }
}

When placed inside a custom-behaviors subfolder it is simply activated by passing -v $PWD/custom-behaviors:/custom-behaviors to docker command and --customBehaviors /custom-behaviors/ to the crawler, not forgetting to activate siteSpecific behavior with --behaviors "siteSpecific,...".

I tried to change the order of behaviors to check if it might have an impact but without success.

@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?

kelson42 commented 4 months ago

@Popolechien @kelson42 would it make any sense to create a ZIM without videos, at least until the issue around videos is solved?

Yes, as temporary solution

benoit74 commented 4 months ago

OK so I finally achieved to find the problem for the images: for some reasons, JS code is adding an inline visibility: hidden style to the first image of every article. I struggle to find the JS responsible for this, so for now we will live with a CSS trick/hack to restore original visibility.

I've created the custom CSS to get rid of the bug, of ads and social links and search boxes.

I've also reconfigured the recipe to include few useful pages which are not inside the /fa suffix but still in Farsi as far as I can tell (authors, petitions and questions).

Last task execution with only few pages (100) proved to create a ZIM which seems OK. I will relaunch again the recipe on all pages and let's see what comes out in more or less 1 week.

benoit74 commented 4 months ago

Last task did not produced a full ZIM at all because we now need to add the base URL in include regexp.

Petitions pages are not displaying any image due to https://github.com/openzim/zimit/issues/316

benoit74 commented 4 months ago

New recipe configuration is now scraping tons of "stupid" pages like https://info@iranwire.com/fa/features/38626/ with a info user. I've canceled the recipe, modified the include setting to not includes these pages and re-requested the recipe.

Popolechien commented 4 months ago

The link you gave looks like a regular entry - what is "stupid" about it?

benoit74 commented 4 months ago

Thank you for asking! The "stupid" thing is that URL is info@iranwire.com instead of iranwire.com. This looks like a bug on their side on a random page which is suddenly duplicating all entries to fetch + store in the ZIM (once for info@iranwire.com and once for iranwire.com).

benoit74 commented 4 months ago

What I said is not totally correct. This is stupid because the info@ part is anyway dropped by warc2zim, so in the end we will store only one entry and all links with info@ will be rewritten without it. So we are "only" loosing our time fetching to many pages (but this is already lot of time lost).

benoit74 commented 3 months ago

It looks like last ZIM is ready for review: https://dev.library.kiwix.org/content/iranwire-com_far_all_2024-07/

What is known to not work:

videos and audios: most have been made inaccessible, but few remain ; I consider it is not feasible to make them work within current timeframe / budget
questions like the ones on iranwire.com/questions/legal/ are not working => upstream bug is https://github.com/openzim/warc2zim/issues/363 and will be solved quickly

@Popolechien please review the ZIM to identify whatever needs to be fixed before communicating the ZIM to the client

benoit74 commented 2 months ago

Moved to prod since we do not have more feedbacks since weeks, it is supposed to be OK.

benoit74 commented 2 months ago

ZIM is ready at https://library.kiwix.org/viewer#iranwire-com_far_all or https://download.kiwix.org/zim/zimit/iranwire-com_far_all_2024-09.zim

openzim / zim-requests

New request: iranwire.com #831