spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
778 stars 129 forks source link

Runtime errors with `ja`, `zh`, and `id`. #550

Closed cyanic-selkie closed 1 year ago

cyanic-selkie commented 1 year ago

Hi,

Thank you for the awesome library!

I am currently using dumpster-dip to generate a dataset from all Wikipedia languages. It ran fine for all languages except ja, zh, id.

Specifically, for ja and zh I got the following error:

TypeError [Error]: Cannot read properties of undefined (reading '0')
    at Object.max (file:///.../node_modules/wtf_wikipedia/src/template/custom/text-only/functions.js:544:25)
    at parseTemplate (file:///.../node_modules/wtf_wikipedia/src/template/parse/index.js:60:32)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:18:24)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:28
    at Array.forEach (<anonymous>)
    at allTemplates (file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:10)
    at process (file:///.../node_modules/wtf_wikipedia/src/template/index.js:51:24)
    at new Section (file:///.../node_modules/wtf_wikipedia/src/02-section/Section.js:59:5)
    at parseSections (file:///.../node_modules/wtf_wikipedia/src/02-section/index.js:69:19)
    at new Document (file:///.../node_modules/wtf_wikipedia/src/01-document/Document.js:81:22)

For id, I got:

TypeError [Error]: Cannot read properties of undefined (reading 'substr')
    at str mid (file:///.../node_modules/wtf_wikipedia/src/template/custom/text-only/functions.js:68:20)
    at parseTemplate (file:///.../node_modules/wtf_wikipedia/src/template/parse/index.js:60:32)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:18:24)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:15:36
    at Array.forEach (<anonymous>)
    at parseNested (file:///.../node_modules/wtf_wikipedia/src/template/index.js:15:20)
    at file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:28
    at Array.forEach (<anonymous>)
    at allTemplates (file:///.../node_modules/wtf_wikipedia/src/template/index.js:40:10)
    at process (file:///.../node_modules/wtf_wikipedia/src/template/index.js:51:24)

It is also worth noting that for es to complete successfully, I had to set --max-old-space-size to 20000, which seems excessive, especially since no other language requires changing the default. If I left it at default (or even set to 10000), I got the following error:

Error [ERR_WORKER_OUT_OF_MEMORY]: Worker terminated due to reaching memory limit: JS heap out of memory
    at new NodeError (node:internal/errors:405:5)
    at [kOnExit] (node:internal/worker:313:26)
    at Worker.<computed>.onexit (node:internal/worker:229:20) {
  code: 'ERR_WORKER_OUT_OF_MEMORY'
}
spencermountain commented 1 year ago

Thank you! I'll take a look at fixing these runtime errors tomorrow. Will release a fix fir them asap Cheers

spencermountain commented 1 year ago

hey @cyanic-selkie - both errors should be fixed now in 10.1.6. Let me know if you see any others.

Yeah - the es memory issue looks like a memleak in dumpster-dive - Can you help me reproduce it? I haven't seen it before. cheers

cyanic-selkie commented 1 year ago

I just reran it for id, zh, and ja and it works without any errors.

The es issue remains. I am using node==20.5.1. on a server with 64 threads and 128 GB of RAM. The code is here. I tried it with 64 and 8 workers, the error happens in both cases after a few minutes of parsing. Do you need any additional information to help you reproduce it?

On a side note, I'd like to suggest using ^10 or similar for the wtf_wikipedia dependency version if you're using SemVer, since I had to clone the repository in order to update to the new version.

spencermountain commented 1 year ago

thanks - that's a real doozy. Wonder why it's only spanish?? I looked at the script, and you haven't declared a few of those variables, which may do it.

i just ran es on my mac and it ran smoothly:

const opts = {
  input: path.join(dir, `/${lang}wiki-latest-pages-articles.xml`),
  outputMode: "ndjson",
  outputDir: path.join(dir, lang),
  parse: function (doc) {
    return doc.json()
  }
}
dip(opts).then(() => {
  console.log('done!')
})

will you try that, on your machine? cheers

spencermountain commented 1 year ago

good idea using ^10. Will add that to the next release.

cyanic-selkie commented 1 year ago

@spencermountain I just fixed the variable declarations and it works perfectly. I'm not used to JS, so thanks!