openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
37 stars 2 forks source link

New request: Minecraft Wiki (zh) #755

Open TripleCamera opened 9 months ago

TripleCamera commented 9 months ago

Please use the following format for a ZIM creation request (and delete unnecessary information)

TripleCamera commented 8 months ago

Hi. May I ask how long it usually takes to fulfill a request? Some of our readers suffer from poor Internet connection, and the offline version might be the only solution.

RavanJAltaie commented 8 months ago

Hi, the recipe is created https://farm.openzim.org/recipes/minecraftwiki_zh_all I'll update the library link here once ready

TripleCamera commented 8 months ago

@RavanJAltaie Thank you! That's fast as lightning!

Unfortunately, if nothing went wrong, something would go wrong. The latest log said that there was a 404 when accessing https://zh.minecraft.wiki/w/api.php?action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc.

This is because the API path is /api.php, not the default /w/api.php. For more information, please check out Special:Version.

xtexChooser commented 8 months ago

the language should be zh instead of nan

RavanJAltaie commented 8 months ago

@xtexChooser @TripleCamera Thanks for your notes, all fixed in the recipe, I re-run it & will follow up.

TripleCamera commented 8 months ago

@RavanJAltaie The language is correct now. However, the value of mwApiPath is not correct. Please change it to /api.php, thanks.

TripleCamera commented 7 months ago

@RavanJAltaie Good news: I just set up the docker environment used by openZIM scrapers. I am importing the config used by the scraper. Then I will try to fix the errors on my machine. I will posts a list of corrected arguments once I finish.

Update: Here is the script:

#!/bin/bash

# Usage: sudo ./run.sh

# For docker:
#     Added: --rm
#     Modified: -v
#     Removed: --detach, --cpu-shares, --memory-swappiness, --memory
# For mwoffliner:
#     Modified: --adminEmail, --customZimDescription
#     Removed: --optimisationCacheUrl, --osTmpDir
docker run \
    -v /home/co-eda/mwoffliner-docker/output:/output:rw \
    --name mwoffliner_minecraftwiki_zh_all \
    --rm \
    ghcr.io/openzim/mwoffliner:1.13.0 \
    mwoffliner \
    --adminEmail="TripleCamera@outlook.com" \
    --customZimDescription="Docker test" \
    --customZimFavicon="https://zh.minecraft.wiki/images/Wiki2x.png" \
    --customZimLanguage="zho" \
    --customZimTitle="Minecraft Wiki (zh)" \
    --format="novid:maxi" \
    --mwApiPath="/api.php" \
    --mwUrl="https://zh.minecraft.wiki/" \
    --outputDirectory="/output" \
    --publisher="openZIM" \
    --webp
TripleCamera commented 7 months ago

@RavanJAltaie TL;DR Please set --customZimFavicon to https://zh.minecraft.wiki/images/Wiki%402x.png, thanks.


I saw that the value of --mwApiPath had been changed to /api.php. However, at the same time, the %40 character in --customZimFavicon had been removed by someone. Please add it back.

The next issue I encountered after fixing this was:

Unable to find appropriate API end-point to retrieve article HTML

I am still investigating about this.

TripleCamera commented 7 months ago

I found the cause of Unable to find appropriate API end-point to retrieve article HTML. Here is a code analysis of MWoffliner v1.13.0 (since all the scrapers are using it).

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/Downloader.ts#L243-L263

  public async checkCapabilities(testArticleId = 'MediaWiki:Sidebar'): Promise<void> {
    // By default check all API's responses and set the capabilities
    // accordingly. We need to set a default page (always there because
    // installed per default) to request the REST API, otherwise it would
    // fail the check.
    this.mwCapabilities.mobileRestApiAvailable = await this.checkApiAvailabilty(this.mw.getMobileRestApiArticleUrl(testArticleId))
    this.mwCapabilities.desktopRestApiAvailable = await this.checkApiAvailabilty(this.mw.getDesktopRestApiArticleUrl(testArticleId))
    this.mwCapabilities.veApiAvailable = await this.checkApiAvailabilty(this.mw.getVeApiArticleUrl(testArticleId))
    this.mwCapabilities.apiAvailable = await this.checkApiAvailabilty(this.mw.apiUrl.href)

    // Coordinate fetching
    // [...]
  }

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/mwoffliner.lib.ts#L206

  await downloader.checkCapabilities(mwMetaData.mainPage)

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L290-L325

  public async getMwMetaData(downloader: Downloader): Promise<MWMetaData> {
    if (this.metaData) {
      return this.metaData
    }

    const creator = this.getCreatorName() || 'Kiwix'

    const [textDir, { langIso2, langIso3, mainPage, siteName }, subTitle] = await Promise.all([
      this.getTextDirection(downloader),
      this.getSiteInfo(downloader),
      this.getSubTitle(downloader),
    ])

    const mwMetaData: MWMetaData = {
      // [...]
      mainPage,
    }

    this.metaData = mwMetaData

    return mwMetaData
  }

https://github.com/openzim/mwoffliner/blob/e9d4113536f0eebdaabe8cc26e25ccdeeca20e32/src/MediaWiki.ts#L235-L279

  public async getSiteInfo(downloader: Downloader) {
    logger.log('Getting site info...')
    const query = 'action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc'
    const body = await downloader.query(query)
    const entries = body.query.general

    // Checking mediawiki version
    const mwVersion = semver.coerce(entries.generator).raw
    const mwMinimalVersion = 1.27
    if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) {
      throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`)
    }

    // Base will contain the default encoded article id for the wiki.
    const mainPage = decodeURIComponent(entries.base.split('/').pop())
    const siteName = entries.sitename

    // [...]

    return {
      mainPage,
      siteName,
      langIso2,
      langIso3,
    }
  }

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

Currently I don't know how to fix this. Do you have any ideas?

rgaudin commented 7 months ago

Currently I don't know how to fix this. Do you have any ideas?

I think you should open a ticket at mwoffliner referencing your comment.

kelson42 commented 7 months ago

I have fixed the recipe - which was wrongly configured - earlier today. We have to document how to configure mwoffliner properly! But no (visual editor) API is available. I have tried with version 1.14 (still in dev), which have more API end-point support, but I'm not over with this.

TripleCamera commented 7 months ago

I think you should open a ticket at mwoffliner referencing your comment.

Okay, I just opened openzim/mwoffliner#1995.

Both the code and the config between v1.13.0 and git main differs a lot. So I need to alter config and test this on git main.

I don't know if this issue can be fixed without modifying code. The worst case would be switching to git main. :frowning_face:

TripleCamera commented 7 months ago

I have fixed the recipe - which was wrongly configured - earlier today. We have to document how to configure mwoffliner properly! But no (visual editor) API is available. I have tried with version 1.14 (still in dev), which have more API end-point support, but I'm not over with this.

Thank you! However, the config between v1.13.0 and git main differs, so you need to rewrite config to make it work.

In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:


Update: @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. So I set --mwRestApiPath="/rest.php/zh.minecraft.wiki/v3/page/html". However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.

TripleCamera commented 6 months ago

Upstream? All right, I will post my progress in the upstream issue.

TripleCamera commented 1 month ago

I'm back. openzim/mwoffliner#1995 has been fixed, which enables MWoffliner to scrape MCW-zh. However, the recipe still fails due to incorrect arguments.

@RavanJAltaie Hi. Could you please fix the recipe? The steps are:

TripleCamera commented 1 month ago

Can someone remove the "Upstream" label and reassign @RavanJAltaie? Thanks.