openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
288 stars 73 forks source link

External content in iframes within article page is not scraped in MDWiki ZIMs #1984

Closed Jaifroid closed 8 months ago

Jaifroid commented 9 months ago

The latest mdwiki_en_all_maxi_2024-02.zim contains content in iframes taken from the MDWiki cloud site https://owidm.wmcloud.org. This is new AFAIK. The content isn't scraped and of course it is blocked by the sandbox (only content from the ZIM is permitted to be displayed for security reasons and CORS). See for example the "Water Pollution" article in the screenshot below. The iframe code is like this:

<iframe src="https://owidm.wmcloud.org/grapher/number-of-deaths-from-diarrheal-diseases-in-people-aged-70-and-older-by-attributable-to-risk-factor" loading="lazy" class="owid-frame"></iframe>

I've seen a number of pages with this blocked content box that stands out like a sore thumb:

image

You can also see it on Kiwix Serve here: https://library.kiwix.org/content/mdwiki_en_all_maxi_2024-02/A/Water_pollution, where it's also blocked.

@WikiDocJames and @tim-moody, I'm not sure how you want to deal with these so the content shows up in the ZIM and in offline apps, rather than as a blocked iframe. It would probably be best if your backend could proxy the graphs (or simply AJAX, Fetch them) and display the fetched content in normal divs, tables or images served locally. Iframes with external content are almost impossible to deal with due to CORS.

kelson42 commented 9 months ago

Hidding this seems the most straight-forward approach. Other approaches impying a fix at MWoffliner seem highly hypothetical to me at this stage.

Jaifroid commented 9 months ago

I agree that there's nothing really we can do in mwOffliner other than hiding / removing the iframes (if there is no fix forthcoming in the MDWiki source / server).

tim-moody commented 9 months ago

The owidm feature is not new, but, this is not how things are supposed to work.

To quote from the page https://mdwiki.org/wiki/WikiProjectMed:OWID

The "ourworldindata" extension adds the class "mw-kartographer-container" to the iframe tag so that mwoffliner will remove the entire element making it visible online but not offline. The "onlyoffline" class, which has "display:none", is applied to the static version of the OWID map which hides it online, but mwoffliner removes that class which causes it to display offline.

So the question is what has gone wrong.

On https://mdwiki.org/w/index.php?title=Diphtheria&action=edit&section=10 I see the onlyoffline template

<!-- Shows online and is interactive -->
{{ourworldindatamirror|share-of-children-immunized-dtp3}}
<!-- Shows offline and is still -->
{{onlyoffline|[[File:Share-of-children-immunized-dtp3 (1).png|thumb|400px]]}}

https://mdwiki.org/w/index.php?title=Water_pollution&action=edit has the same code

<!-- Shows online and is interactive -->
{{ourworldindatamirror|number-of-deaths-from-diarrheal-diseases-in-people-aged-70-and-older-by-attributable-to-risk-factor}}
<!-- Shows offline and is still -->
{{onlyoffline|[[File:Number-of-deaths-from-diarrheal-diseases-in-people-aged-70-and-older-by-attributable-to-risk-factor.png|thumb|400px]]}}
Jaifroid commented 9 months ago

Ah, that explains it. I had never seen it before in the MDWiki ZIMs despite extensive testing for the WikiMed app, but I saw them almost immediately with this month's scrape. However, I can't be certain of the exact month these have begun showing.

WikiDocJames commented 9 months ago

So we had build templates such that it would show the interactive version live on MDWiki and show still images in offline.

This is the template that was achieving that https://mdwiki.org/w/index.php?title=Template:Onlyoffline&action=edit

And it was working well for the longest time.

In the example you show https://library.kiwix.org/content/mdwiki_en_all_maxi_2024-02/A/Water_pollution

The static version is below the blocked content. Can we just suppress the blocked content? It used to just not show at all.

On Tue, Feb 6, 2024 at 5:59 PM Jaifroid @.***> wrote:

Ah, that explains it. I had never seen it before in the MDWiki ZIMs despite extensive testing for the WikiMed app, but I saw them almost immediately with this month's scrape. However, I can't be certain of the exact month these have begun showing.

— Reply to this email directly, view it on GitHub https://github.com/openzim/mwoffliner/issues/1984#issuecomment-1931047054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWZS6LU5HOAZEZ2XMOP3TYSLGW5AVCNFSM6AAAAABC4NOWPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGA2DOMBVGQ . You are receiving this because you were mentioned.Message ID: @.***>

-- James Heilman MD, CCFP-EM, Wikipedian

Jaifroid commented 9 months ago

@WikiDocJames It sounds like suppressing the blocked content is how it's meant to work. We just need to determine whether it should be fixed here (MWOffliner) or in the API.

tim-moody commented 9 months ago

We just need to determine whether it should be fixed here (MWOffliner) or in the API.

It was definitely working in mwoffliner previously. I don't see how It can be the API because it is not active when offline. My understanding is that this was actually old mwoffliner functionality that was added for the maps (kartographer) extension (hence the 'mw' in the class name), and we took advantage of it. I believe we asked for the onlyoffline tag functionality to support owid.

tim-moody commented 9 months ago

please note https://github.com/search?q=repo%3Aopenzim%2Fmwoffliner%20onlyoffline&type=code

Jaifroid commented 9 months ago

I don't see how It can be the API because it is not active when offline.

I just meant the API that MWOffliner accesses to get the content at scrape time. I believe there was a recent switch away from the deprecated mobile endpoint, but I'm not familiar enough with all this to help pinpoint the cause (I work mainly on the JS readers and their backend). I'm sure @kelson42 can clarify or else ping the right dev!

tim-moody commented 9 months ago

OK. That's possible. Another thing that occurred to me is that MWF has obsoleted graphics on wikipedia pages, and that includes the old map extension I believe. But then I would expect it to be broken on mdwiki.org as well.

Jaifroid commented 9 months ago

Another example (for testing) is the article "Physical dependence". And here is a regular expression that removes the offending iframe(s) from the html (though I know mwOffliner has its own methods, and I'm not sure why they're not working):

html = html.replace(/<iframe\b[^>]+class=["'][^"']*?owid-frame(?:[^<]|<(?!\/iframe>))+<\/iframe>\s*/ig, '');

FYI I'm adding this as a temporary workaround to the WikiMed desktop app code, so that I can release this month. I think the Android app may still be blocked by https://github.com/kiwix/kiwix-android/issues/3511.

tim-moody commented 9 months ago

OK. I did a little digging, and the problem may be at our end.

https://iiab.me/kiwix/mdwiki_en_all_maxi_2022-11/A/Asthma was working

but https://iiab.me/kiwix/mdwiki_en_all_2023-10/A/Asthma was not

the html of current https://mdwiki.org/wiki/Asthma is

<iframe src="https://owidm.wmcloud.org/grapher/asthma-prevalence" loading="lazy" class="owid-frame"></iframe>

I think it should be class="mw-kartographer-container"

I will be less complicated to add the class owid-frame to the mwoffliner blacklist than to change the class to mw-kartographer-container and it will better reflect reality.

I have created a PR #1990

Jaifroid commented 9 months ago

Thanks, @tim-moody. FYI, the new WikiMed release contains the workaround so that the included February MDWiki archive will display correctly.

kelson42 commented 8 months ago

I guess this has been fixed by https://github.com/openzim/mwoffliner/pull/1990. Thx @tim-moody