openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
289 stars 73 forks source link

[REGRESSION] Language are not written properly in the ZIM file names #624

Closed kelson42 closed 5 years ago

kelson42 commented 5 years ago

For example, scraping https://pdc.wikipedia.org should generate file like wikipedia_pdc_all_nopic_2019-02.zim but now it generates a file like wikipedia_de_all_nopic_2019-02.zim

This needs to be fixed in a 1.8 maintenance release

ISNIT0 commented 5 years ago

This bug is weird. The behaviour is to use mwMetadata.langISO2 in the file name. When we retrieve the wiki language, PDC is ISO3 not 2, which means we don't have an ISO2 language. The language conversion library we use doesn't recognise PDC as a language, so we switch to use the fallback of de.

I've proposed a change #625 which reverts back to the old logic: Set mwMetadata.langISO2 to be whatever language is returned, even if it's not an ISO2 code.

This now makes mwoffliner behave the same way it did, but this is not correct. At some point we should either switch to using ISO3 codes in the filenames, or find another solution to this.

Thoughts @kelson42? I don't know what could break by changing the filename

kelson42 commented 5 years ago

@ISNIT0 metadata is iso3 and filename is iso2. So change #625 should not be implemented. You should come back to the 1.7 or even 1.6 logic.

ISNIT0 commented 5 years ago

But you said that you want pdc in the file name. This is not iso2

ISNIT0 commented 5 years ago

This change brings it back to 1.7 logic

kelson42 commented 5 years ago

@ISNIT0 I forgot to say: if there is no iso2, then write iso3. pdc is in that case, de is not the iso2 version of pdc. But I think you should not try to reinvent something, just rollback to the previous logic. It was correct.

kelson42 commented 5 years ago

Here is the logic in 1.5, this does not seem to work the same manner in 1.8

  /* Language */
    env.zim.langIso2 = entries.lang;
    countryLanguage.getLanguage(env.zim.langIso2, function (error, language) {
      if (error || !language.iso639_3) {
        env.zim.langIso3 = env.zim.langIso2;
      } else {
        env.zim.langIso3 = language.iso639_3;
      }
      cb();
    });
ISNIT0 commented 5 years ago

I've implemented and merged this change