spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
242 stars 45 forks source link

wp2mongo hangs during article insertion. #21

Closed MTKnife closed 6 years ago

MTKnife commented 6 years ago

Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.

There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).

I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.

e501 commented 6 years ago

This also happed to me yesterday.

Over the last few weeks, I have been using parsing results from a dump file that loaded approx. 5.7 articles a few weeks ago. For the articles that were parsed, the results have worked great with mongolite (R package).

Yesterday, when I tried to rerun wp2mongo for the same enwiki file, the code hangs at the 777th article (Ibn al-Haytham), just as described in the posting. I then tried the most recent dump of enwiki and had same result. Rebooting and reinstalling produced same results. Note that the afwiki dump loads into mongodb with no problems.

Also, at the wtf_wikipedia web page (https://spencermountain.github.io/wtf_wikipedia/), the "Algiers" and other previous articles are immediately parsed and displayed, but a fetch for the "Ibn al-Haytham" article is not responsive.

spencermountain commented 6 years ago

agh, weird thanks you two. it looks like there is a bug on the Ibn al-Haytham in wtf_wikipedia. That's got to be it. I will try to fix it today.

spencermountain commented 6 years ago

fwiw, maybe this library should wrap wtf_wikipedia in a try/catch. It would make things slower (i think?) but these sort of errors are probably somewhat inevitable, given the size & nature of wikipedia. any insight or thoughts about doing that?

e501 commented 6 years ago

From previous experience with using json parsers for Wikipedia articles, there has been an issue with extraneous characters and character combinations embedded in the articles. My thinking has been that to guard against confusing the respective parser, a preprocessor may need to validate that the string/blob of characters can be parsed, relative to the known specification and associated constraints/capabilities of the given parser. At this time, I use an ad-hoc collection of regex expressions that strip out potentially-problematic character strings.

Basically, the idea is to have something like a lint process (https://en.wikipedia.org/wiki/Lint_(software)) for flagging suspicious character combinations that may be in the input string (i.e. Wikipedia article). In this case where the parser is hanging (Ibn al-Haytham), there may be an issue with the additional types of characters that are embedded with the more usual English language characters.

spencermountain commented 6 years ago

hey, fixed the this Ibn al-Haytham issue in v2.1.0. can you guys check this out? cheers

e501 commented 6 years ago

Many thanks for fixing the initial problem.

With the new fix, there are new recurring parsing errors, such as the following:

Julia Kristeva 10468 Error: key 0-19-518767-9}}, [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) Juan Miro 10469 Just intonation 10470 Josephus 10471 Jan Borukowski 10472 Judy Blume 10473 Joel Marangella 10474 John Pople 10475 Jake McDuck 10476 Jerry Falwell 10477 Jebus 10478 Jay Leno 10479 Jeroboam II 10480 JTF-CNO 10481 Joan of Arc 10482 Error: key Centre Historique des Archives Nationales]], [[Paris]], AE II 2490, dated to the second half of the 15th century. "The later, fifteenth-century manuscript of [[Charles, Duke of Orléans]] contains a miniature of Joan in armour; the face has certain characteristic features known from her contemporaries' descriptions, and the artist may have worked from indications by someone who had known her." Joan M. Edmunds, The Mission of Joan of Arc (2008), [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) Johannes Nicolaus Brønsted 10483 Janus kinase 10484 Jacob Grimm 10485 Jamiroquai 10486 John Sutter 10487 John Adams (composer) 10488 Jon Voight 10489 Error: key 978-0-306-80900-2}}, p.236. [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) John Climacus 10490 John of the Ladder 10491

At the wtf_wikipedia web page, the parser looks to be returning partial results for the problematic articles, wherein some of the sections look to be missing some content.

MTKnife commented 6 years ago

On the subject of the try/catch block, I did a quick search, and what I found suggests that what's true for Python (with which I'm familiar) is also true of JavaScript (with which I'm not familiar): unless there's very little happening in the try/catch block, the cost is trivial when there's no error. I think, given the size and complexity of Wikipedia, especially when you consider the multiple available languages, that you really should have error-catching, or you're likely to run into this kind of problem again. It would probably also be a good idea to produce an output that indicates any articles that have been skipped.

spencermountain commented 6 years ago

yeah, ok. that convinced me.

oh, @e501 i know what that is. will try to do it today.

thanks for the help!

spencermountain commented 6 years ago

@e501 fixed the encodings for citation keys, and @MTKnife added try/catch statements around the wikipedia parsing stuff. 2.2.0 tests passing. lemme know

MTKnife commented 6 years ago

I'm running it with Redis right now, and I think it's working, but I'm not entirely sure what I'm supposed to be seeing. I'm no longer getting errors (or any feedback whatsoever) when I run "worker.js", and the "wikipedia" collection in the "wikipedia_queue" database is getting steadily larger. However, I've noticed that the database is initiated and insertions start to take place before I run "worker.js", which leads me to assume that "wp2mongo" is inserting unparsed articles in the db. Presumably "worker.js" then goes back and parses those articles and modifies the records in question--so what should I look for to be sure that the records I'm seeing have been parsed?

Whoops...we just got some kind of Redis error. Here's what it looks like:

La Puente, California 71751 La Verne, California 71752 Ladera Heights, California 71753 events.js:183 throw er; // Unhandled 'error' event ^

Error: Redis connection to localhost:6379 failed - read ECONNRESET at _errnoException (util.js:1024:11) at TCP.onread (net.js:615:25)

Since I'm about to leave work, I'm re-running it without the Redis, and so far that's working.

UPDATE: The Redis thing appears to be some kind of issue external to wp2mongo. See this link.

e501 commented 6 years ago

Many thanks for your rapid turnaround on the most recent fixes. This evening, I was able to process up to 5508058 articles from the enwiki-20171120-pages-articles.xml.bz2 dump. Unfortunately, right after the "Love You More (Ginuwine song) 5508058" string was output to the terminal, the following core dump occurred:

Stacktrace: magic1=bbbbbbbb magic2=bbbbbbbb ptr1=0xdc8db802459 ptr2=(nil) ptr3=(nil) ptr4=(nil) ptr5=(nil) ptr6=(nil) ptr7=(nil) ptr8=(nil)

==== JS stack trace =========================================

Security context: 0x38b8e6425729 #0# 2: parse [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/doPage.js:64] [bytecode=0x2b06ad8b2651 offset=50](this=0x11d500ff4251 #1#,options=0xdc8db802459 ,cb=0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2#) 3: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/index.js:75] [bytecode=0x2b06ad8b1091 offset=298](this=0x4639680e1f1 #3#,page=0x2166a5e4fd81 #4#) 4: arguments adaptor frame: 3->1 5: emitThree(aka emitThree) [events.js:~143] [pc=0x2dc36ad157b](this=0xdc8db8022d1 ,handler=0x4639680ef11 <JSFunction (sfi = 0x2b06ad88d501)>#5#,isFn=0xdc8db802371 ,self=0x4639680e1f1 #3#,arg1=0x2166a5e4fd81 #4#,arg2=0x2166a5e4fd49 #6#,arg3=0x4639680e349 #7#) 6: fn [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~131] [pc=0x2dc3133b320](this=0x11d53748c0f9 #8#,element=0x2166a5e4fd81 #4#,context=0x2166a5e4fd49 #6#,trace=0x4639680e349 #7#) 7: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~401] [pc=0x2dc31356b67](this=0x4639680e2b9 #9#,name=0x2166a5e41669 <String[4]: page>) 8: emit [events.js:~165] [pc=0x2dc31394c09](this=0x4639680e2b9 #9#,/ anonymous /=0x2166a5e41649 <String[10]: endElement>) 9: arguments adaptor frame: 2->1 13: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~519] [pc=0x2dc36afc747](this=0x22f939cc0011 #10#,data=0x2166a5e0a251 #11#) 14: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0011 #10#,/ anonymous /=0x38b8e6434769 <String[4]: data>) 15: arguments adaptor frame: 2->1 16: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:~64] [pc=0x2dc36ae48a3](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) 17: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:~25] [pc=0x2dc319cb8c0](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) 18: ondata [_stream_readable.js:642] [bytecode=0x2b06ad89aad1 offset=30](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#) 19: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0271 #13#,/ anonymous /=0x38b8e6434769 <String[4]: data>) 20: arguments adaptor frame: 2->1 21: addChunk(aka addChunk) [_stream_readable.js:265] [bytecode=0xfc3902f3091 offset=35](this=0xdc8db8022d1 ,stream=0x22f939cc0271 #13#,state=0x22f939cc0359 #14#,chunk=0x2a47dc0022b9 #12#,addToFront=0xdc8db8023e1 ) 22: readableAddChunk(aka readableAddChunk) [_stream_readable.js:252] [bytecode=0xfc3902f2789 offset=377](this=0xdc8db8022d1 ,stream=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ,addToFront=0xdc8db8023e1 ,skipChunkCheck=0xdc8db8022d1 ) 23: push [_stream_readable.js:209] [bytecode=0xfc3902f2481 offset=89](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ) 24: arguments adaptor frame: 1->2 25: onread(aka onread) [fs.js:2095] [bytecode=0x2b06ad89a6f1 offset=122](this=0xdc8db8022d1 ,er=0xdc8db802201 ,bytesRead=65536) 26: arguments adaptor frame: 3->2 27: oncomplete(aka wrapper) [fs.js:676] [bytecode=0x2b06ad89a571 offset=23](this=0x2a47dc002479 #15#,err=0xdc8db802201 ,bytesRead=65536)

==== Details ================================================

[2]: parse [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/doPage.js:64] [bytecode=0x2b06ad8b2651 offset=50](this=0x11d500ff4251 #1#,options=0xdc8db802459 ,cb=0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2#) { // stack-allocated locals var data = 0x10d0e7fed429 #16# // heap-allocated locals var cb = 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2# // expression stack (top to bottom) [05] : 0xdc8db802569 [04] : 0xdc8db802569 [03] : 0xdc8db802569 [02] : 0xdc8db802569 [01] : 0x11d500ff3569 <FixedArray[7]>#17# --------- s o u r c e c o d e --------- function parse(options, cb) {\x0a let data = wtf.parse(options.script);\x0a data = encodeData(data);\x0a data.title = options.title;\x0a data._id = encodeStr(options.title);\x0a // options.collection.update({ _id: data._id }, data, { upsert: true }, function(e) {\x0a options.collection.insert(data, function(e) {\x0a if (e) {...


}

[3]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/index.js:75] [bytecode=0x2b06ad8b1091 offset=298](this=0x4639680e1f1 #3#,page=0x2166a5e4fd81 #4#) { // expression stack (top to bottom) [11] : 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2# [10] : 0xdc8db802459 [09] : 0x11d500ff4251 #1# [08] : 0xdc8db802569 [07] : 0xdc8db802569 [06] : 0xdc8db802569 [05] : 0xdc8db802569 [04] : 0xdc8db802569 [03] : 0xdc8db802569 [02] : 0x4639680f091 <FixedArray[8]>#18# [01] : 0xdc8db802569 [00] : 0xdc8db802569 --------- s o u r c e c o d e --------- function (page) {\x0a if (page.ns === '0') {\x0a let script = page.revision.text['$text'] || '';\x0a\x0a console.log(leftPad(page.title) + ' ' + i);\x0a ++i;\x0a\x0a let data = {\x0a title: page.title,\x0a script: script\x0a };\x0a\x0a if (obj.worker) {\x0a // we send job t...


}

[4]: arguments adaptor frame: 3->1 { // actual arguments [00] : 0x2166a5e4fd81 #4# [01] : 0x2166a5e4fd49 #6# // not passed to callee [02] : 0x4639680e349 #7# // not passed to callee }

[5]: emitThree(aka emitThree) [events.js:~143] [pc=0x2dc36ad157b](this=0xdc8db8022d1 ,handler=0x4639680ef11 <JSFunction (sfi = 0x2b06ad88d501)>#5#,isFn=0xdc8db802371 ,self=0x4639680e1f1 #3#,arg1=0x2166a5e4fd81 #4#,arg2=0x2166a5e4fd49 #6#,arg3=0x4639680e349 #7#) { // optimized frame --------- s o u r c e c o d e --------- function emitThree(handler, isFn, self, arg1, arg2, arg3) {\x0a if (isFn)\x0a handler.call(self, arg1, arg2, arg3);\x0a else {\x0a var len = handler.length;\x0a var listeners = arrayClone(handler, len);\x0a for (var i = 0; i < len; ++i)\x0a listeners[i].call(self, arg1, arg2, arg3);\x0a }\x0a}

} [6]: fn [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~131] [pc=0x2dc3133b320](this=0x11d53748c0f9 #8#,element=0x2166a5e4fd81 #4#,context=0x2166a5e4fd49 #6#,trace=0x4639680e349 #7#) { // optimized frame --------- s o u r c e c o d e --------- function fn(element, context, trace) {\x0a self.emit(event.name, element, context, trace);\x0a }

} [7]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~401] [pc=0x2dc31356b67](this=0x4639680e2b9 #9#,name=0x2166a5e41669 <String[4]: page>) { // optimized frame --------- s o u r c e c o d e --------- function (name) {\x0a self.emit('endElement', name);\x0a var prev = stack.pop();\x0a var element = curr.element;\x0a var text = curr.fullText;\x0a var attr = element.$;\x0a if (typeof attr !== 'object') {\x0a attr = {};\x0a }\x0a var name = element.$name;\x0a self._name = name;\x0a delete element.$;\x0a de...


} [8]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x4639680e2b9 #9#,/ anonymous /=0x2166a5e41649 <String[10]: endElement>) { // optimized frame --------- s o u r c e c o d e --------- function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...


} [9]: arguments adaptor frame: 2->1 { // actual arguments

[01] : 0x2166a5e41669 <String[4]: page> // not passed to callee }

[13]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~519] [pc=0x2dc36afc747](this=0x22f939cc0011 #10#,data=0x2166a5e0a251 #11#) { // optimized frame --------- s o u r c e c o d e --------- function (data) {\x0a if (self._encoding) {\x0a parseChunk(data);\x0a } else {\x0a // We can't parse when the encoding is unknown, so we'll look into\x0a // the XML declaration, if there is one. For this, we need to buffer\x0a // incoming data until a full tag is received.\x0a preludeBuffers.push(d...


} [14]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0011 #10#,/ anonymous /=0x38b8e6434769 <String[4]: data>) { // optimized frame --------- s o u r c e c o d e --------- function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...


} [15]: arguments adaptor frame: 2->1 { // actual arguments

[01] : 0x2166a5e0a251 #11# // not passed to callee }

[16]: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:~64] [pc=0x2dc36ae48a3](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) { // optimized frame --------- s o u r c e c o d e --------- function write(data) {\x0a //console.error('received', data.length,'bytes in', typeof data);\x0a bufferQueue.push(data);\x0a hasBytes += data.length;\x0a if (bitReader === null) {\x0a bitReader = bitIterator(function() {\x0a return bufferQueue.shift();\x0a ...


} [17]: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:~25] [pc=0x2dc319cb8c0](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) { // optimized frame --------- s o u r c e c o d e --------- function (data) {\x0a write.call(this, data)\x0a return !stream.paused\x0a }

} [18]: ondata [_stream_readable.js:642] [bytecode=0x2b06ad89aad1 offset=30](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#) { // stack-allocated locals var ret = 0xdc8db8022d1 // expression stack (top to bottom) [05] : 0x2a47dc0022b9 #12# [04] : 0x22f939cc0011 #10# [03] : 0xdc8db8022d1 [02] : 0x22f939cc0011 #10# [01] : 0x2b06ad892bb9 <JSFunction stream.write (sfi = 0x2b06ad892131)>#19# --------- s o u r c e c o d e --------- function ondata(chunk) {\x0a debug('ondata');\x0a increasedAwaitDrain = false;\x0a var ret = dest.write(chunk);\x0a if (false === ret && !increasedAwaitDrain) {\x0a // If the user unpiped during dest.write(), it is possible\x0a // to get stuck in a permanently paused state if that write\x0a // also returne...


}

[19]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0271 #13#,/ anonymous /=0x38b8e6434769 <String[4]: data>) { // optimized frame --------- s o u r c e c o d e --------- function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...


} [20]: arguments adaptor frame: 2->1 { // actual arguments

[01] : 0x2a47dc0022b9 #12# // not passed to callee }

[21]: addChunk(aka addChunk) [_stream_readable.js:265] [bytecode=0xfc3902f3091 offset=35](this=0xdc8db8022d1 ,stream=0x22f939cc0271 #13#,state=0x22f939cc0359 #14#,chunk=0x2a47dc0022b9 #12#,addToFront=0xdc8db8023e1 ) { // expression stack (top to bottom) [05] : 0x2a47dc0022b9 #12#

[03] : 0x22f939cc0271 #13#

[01] : 0xdc8db8022d1 [00] : 0x38b8e643b3d9 <JSFunction emit (sfi = 0x38b8e6439ee1)>#20# --------- s o u r c e c o d e --------- function addChunk(stream, state, chunk, addToFront) {\x0a if (state.flowing && state.length === 0 && !state.sync) {\x0a stream.emit('data', chunk);\x0a stream.read(0);\x0a } else {\x0a // update the buffer info.\x0a state.length += state.objectMode ? 1 : chunk.length;\x0a if (addToFront)\x0a state.buffer.unshift(chunk...


}

[22]: readableAddChunk(aka readableAddChunk) [_stream_readable.js:252] [bytecode=0xfc3902f2789 offset=377](this=0xdc8db8022d1 ,stream=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ,addToFront=0xdc8db8023e1 ,skipChunkCheck=0xdc8db8022d1 ) { // stack-allocated locals var state = 0x22f939cc0359 #14# var er = 0xdc8db8022d1 // expression stack (top to bottom) [11] : 0xdc8db8023e1 [10] : 0x2a47dc0022b9 #12# [09] : 0x22f939cc0359 #14# [08] : 0x22f939cc0271 #13# [07] : 0xdc8db8022d1 [06] : 0xdc8db8023e1 [05] : 0x2a47dc0022b9 #12# [04] : 0x22f939cc0359 #14# [03] : 0x22f939cc0271 #13# [02] : 0x5b65f2b92b1 <JSFunction addChunk (sfi = 0x153d2b2e0859)>#21# --------- s o u r c e c o d e --------- function readableAddChunk(stream, chunk, encoding, addToFront, skipChunkCheck) {\x0a var state = stream._readableState;\x0a if (chunk === null) {\x0a state.reading = false;\x0a onEofChunk(stream, state);\x0a } else {\x0a var er;\x0a if (!skipChunkCheck)\x0a er = chunkInvalid(state, chunk);\x0a if (er) {\x0a stream.emit('error...


}

[23]: push [_stream_readable.js:209] [bytecode=0xfc3902f2481 offset=89](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ) { // stack-allocated locals var state = 0x22f939cc0359 #14# var skipChunkCheck = 0xdc8db8022d1 // expression stack (top to bottom) [13] : 0xdc8db8022d1 [12] : 0xdc8db8023e1 [11] : 0xdc8db8022d1 [10] : 0x2a47dc0022b9 #12# [09] : 0x22f939cc0271 #13# [08] : 0xdc8db8022d1 [07] : 0xdc8db8022d1 [06] : 0xdc8db8023e1 [05] : 0xdc8db8022d1 [04] : 0x2a47dc0022b9 #12# [03] : 0x22f939cc0271 #13# [02] : 0x5b65f2b9269 <JSFunction readableAddChunk (sfi = 0x153d2b2e07b1)>#22# --------- s o u r c e c o d e --------- function (chunk, encoding) {\x0a var state = this._readableState;\x0a var skipChunkCheck;\x0a\x0a if (!state.objectMode) {\x0a if (typeof chunk === 'string') {\x0a encoding = encoding || state.defaultEncoding;\x0a if (encoding !== state.encoding) {\x0a chunk = Buffer.from(chunk, encoding);\x0a encoding = ...


}

[24]: arguments adaptor frame: 1->2 { // actual arguments [00] : 0x2a47dc0022b9 #12# }

[25]: onread(aka onread) [fs.js:2095] [bytecode=0x2b06ad89a6f1 offset=122](this=0xdc8db8022d1 ,er=0xdc8db802201 ,bytesRead=65536) { // stack-allocated locals var b = 0x2a47dc0022b9 #12# // expression stack (top to bottom) [06] : 0x2a47dc0022b9 #12# [05] : 0x22f939cc0271 #13# [04] : 65536 [03] : 0 [02] : 0x22f939cc0271 #13# [01] : 0x153d2b2e4b79 <JSFunction Readable.push (sfi = 0x153d2b2e1819)>#23# --------- s o u r c e c o d e --------- function onread(er, bytesRead) {\x0a if (er) {\x0a if (self.autoClose) {\x0a self.destroy();\x0a }\x0a self.emit('error', er);\x0a } else {\x0a var b = null;\x0a if (bytesRead > 0) {\x0a self.bytesRead += bytesRead;\x0a b = thisPool.slice(start, start + bytesRead);\x0a }\x0a\x0a self.push(b)...


}

[26]: arguments adaptor frame: 3->2 { // actual arguments [00] : 0xdc8db802201 [01] : 65536 [02] : 0x2a47dc0023a1 #24# // not passed to callee }

[27]: oncomplete(aka wrapper) [fs.js:676] [bytecode=0x2b06ad89a571 offset=23](this=0x2a47dc002479 #15#,err=0xdc8db802201 ,bytesRead=65536) { // expression stack (top to bottom) [07] : 0x2a47dc0023a1 #24# [06] : 65536 [05] : 0xdc8db802201 [04] : 0xdc8db8022d1 [03] : 0x2a47dc0023a1 #24# [02] : 65536 [01] : 0xdc8db802201 [00] : 0x2a47dc002309 <JSFunction onread (sfi = 0x2b06ad897511)>#25# --------- s o u r c e c o d e --------- function wrapper(err, bytesRead) {\x0a // Retain a reference to buffer so that it can't be GC'ed too soon.\x0a callback && callback(err, bytesRead || 0, buffer);\x0a }

}

==== Key ============================================

0# 0x38b8e6425729: 0x38b8e6425729

1# 0x11d500ff4251: 0x11d500ff4251
         parse: 0x4639685ef51 <JSFunction parse (sfi = 0x2008bde2cd9)>#26#
     plaintext: 0x4639685ef99 <JSFunction plaintext (sfi = 0x2008bde2d81)>#27#

2# 0x2166a5e41a79: 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>

3# 0x4639680e1f1: 0x4639680e1f1

        domain: 0xdc8db802201 <null>
       _events: 0x4639680e669 <Object map = 0x212720f066c1>#28#
  _eventsCount: 3
 _maxListeners: 0xdc8db8022d1 <undefined>
       _stream: 0x22f939cc0011 <Stream map = 0x239e1ac98ca9>#10#
           _fa: 0x4639680e3c9 <FiniteAutomata map = 0x239e1ac999b9>#29#
    _lastState: 2
   _startState: 0x4639680e531 <Object map = 0x212720f023b9>#30#
  _finalStates: 0x4639680e681 <Object deprecated-map = 0x239e1ac99c21>#31#
     _emitData: 0xdc8db8023e1 <false>
  _bufferLevel: 0
_preserveLevel: 0

_preserveWhitespace: 0 _preserveAll: 0xdc8db802371 _collect: 0xdc8db8023e1 _parser: 0x4639680e2b9 #9# _encoding: 0x38b8e6434749 <String[4]: utf8> _encoder: 0xdc8db802201 _suspended: 0xdc8db8023e1 _name: 0x2166a5e4fd29 <String[4]: page>

4# 0x2166a5e4fd81: 0x2166a5e4fd81

5# 0x4639680ef11: 0x4639680ef11 <JSFunction (sfi = 0x2b06ad88d501)>

6# 0x2166a5e4fd49: 0x2166a5e4fd49
          page: 0x2166a5e4fd81 <Object map = 0x72dc3db49d1>#4#

7# 0x4639680e349: 0x4639680e349
              : 0x4639680e5c1 <Object map = 0x239e1ac9acf9>#32#
    /mediawiki: 0x3a904bb49a59 <Object map = 0x239e1ac9ba61>#33#

/mediawiki/siteinfo: 0x3a904bb49a89 #34# /mediawiki/siteinfo/namespaces: 0x3a904bb49aa1 #35# /mediawiki/page: 0x2166a5e4fd81 #4# /mediawiki/page/revision: 0x2166a5e50051 #36# /mediawiki/page/revision/contributor: 0x2166a5e50231 #37#

8# 0x11d53748c0f9: 0x11d53748c0f9

9# 0x4639680e2b9: 0x4639680e2b9

      encoding: 0x38b8e6434701 <String[5]: utf-8>
        parser: 0x4639680e631 <JSObject>#38#
      writable: 0xdc8db802371 <true>
      readable: 0xdc8db802371 <true>
       _events: 0x4639680e651 <Object map = 0x212720f066c1>#39#
  _eventsCount: 3
      _collect: 0xdc8db8023e1 <false>

10# 0x22f939cc0011: 0x22f939cc0011

        domain: 0xdc8db802201 <null>
       _events: 0x22f939cea359 <Object map = 0x212720f066c1>#40#
  _eventsCount: 7
 _maxListeners: 0xdc8db8022d1 <undefined>
      writable: 0xdc8db802371 <true>
      readable: 0xdc8db802371 <true>
        paused: 0xdc8db8023e1 <false>
   autoDestroy: 0xdc8db802371 <true>

11# 0x2166a5e0a251: 0x2166a5e0a251

12# 0x2a47dc0022b9: 0x2a47dc0022b9

13# 0x22f939cc0271: 0x22f939cc0271

_readableState: 0x22f939cc0359 <ReadableState map = 0x212720f40581>#14#
      readable: 0xdc8db802371 <true>
        domain: 0xdc8db802201 <null>
       _events: 0x22f939cc0341 <Object map = 0x212720f066c1>#41#
  _eventsCount: 2
 _maxListeners: 0xdc8db8022d1 <undefined>
          path: 0x87957d20ab9 <String[51]: /home/jayson/enwiki-20171120-pages-articles.xml.bz2>
            fd: 13
         flags: 0xdc8db80fa31 <String[1]: r>
          mode: 438
         start: 0xdc8db8022d1 <undefined>
           end: 0xdc8db8022d1 <undefined>
     autoClose: 0xdc8db802371 <true>
           pos: 0xdc8db8022d1 <undefined>
     bytesRead: <unboxed double> 7041646592

14# 0x22f939cc0359: 0x22f939cc0359

    objectMode: 0xdc8db8023e1 <false>
 highWaterMark: 65536
        buffer: 0x22f939ce9ee1 <BufferList map = 0x212720f3fef9>#42#
        length: 0
         pipes: 0x22f939cc0011 <Stream map = 0x239e1ac98ca9>#10#
    pipesCount: 1
       flowing: 0xdc8db802371 <true>
         ended: 0xdc8db8023e1 <false>
    endEmitted: 0xdc8db8023e1 <false>
       reading: 0xdc8db8023e1 <false>
          sync: 0xdc8db8023e1 <false>
  needReadable: 0xdc8db802371 <true>

emittedReadable: 0xdc8db8023e1 readableListening: 0xdc8db8023e1 resumeScheduled: 0xdc8db8023e1 destroyed: 0xdc8db8023e1 defaultEncoding: 0x38b8e6434749 <String[4]: utf8> awaitDrain: 0 readingMore: 0xdc8db8023e1 decoder: 0xdc8db802201 encoding: 0xdc8db802201

15# 0x2a47dc002479: 0x2a47dc002479

    oncomplete: 0x2a47dc0023f1 <JSFunction wrapper (sfi = 0x2b06ad89a161)>#43#

16# 0x10d0e7fed429: 0x10d0e7fed429
          type: 0x87957d7a1e9 <String[4]: page>
      sections: 0x10d0e7fed659 <JSArray[4]>#44#
     infoboxes: 0x10d0e7fed531 <JSArray[1]>#45#
     interwiki: 0x10d0e7fed371 <Object map = 0x212720f023b9>#46#
    categories: 0x10d0e7fed639 <JSArray[3]>#47#
        images: 0x10d0e7fed3c9 <JSArray[0]>#48#
   coordinates: 0x10d0e7fed3e9 <JSArray[0]>#49#
     citations: 0x10d0e7fed409 <JSArray[2]>#50#

page_identifier: 0xdc8db802201 lang_or_wikiid: 0xdc8db802201

17# 0x11d500ff3569: 0x11d500ff3569 <FixedArray[7]>

             0: 0x11d500ff35b1 <JSFunction (sfi = 0x2008bde2a49)>#51#
             1: 0x38b8e6403d41 <FixedArray[282]>#52#
             3: 0x38b8e6403d41 <FixedArray[282]>#52#
             4: 0x11d500ff42f9 <Object map = 0x239e1ac8ae99>#53#
             5: 0x11d500ff4339 <JSFunction encodeStr (sfi = 0x2008bde2b89)>#54#
             6: 0x11d500ff4381 <JSFunction encodeData (sfi = 0x2008bde2c31)>#55#

18# 0x4639680f091: 0x4639680f091 <FixedArray[8]>

             0: 0x22f939cd8799 <JSFunction (sfi = 0x3a4f11ea349)>#56#
             1: 0x22f939ce6f61 <FixedArray[8]>#57#
             3: 0x38b8e6403d41 <FixedArray[282]>#52#
             4: 0x22f939cec821 <Db map = 0x239e1ac8f569>#58#
             5: 0x4639680f349 <Collection map = 0x239e1ac980a1>#59#
             6: 5508059
             7: 0x4639680f3a9 <JSFunction done (sfi = 0x2b06ad88d651)>#60#

19# 0x2b06ad892bb9: 0x2b06ad892bb9 <JSFunction stream.write (sfi = 0x2b06ad892131)>

20# 0x38b8e643b3d9: 0x38b8e643b3d9 <JSFunction emit (sfi = 0x38b8e6439ee1)>

21# 0x5b65f2b92b1: 0x5b65f2b92b1 <JSFunction addChunk (sfi = 0x153d2b2e0859)>

22# 0x5b65f2b9269: 0x5b65f2b9269 <JSFunction readableAddChunk (sfi = 0x153d2b2e07b1)>

23# 0x153d2b2e4b79: 0x153d2b2e4b79 <JSFunction Readable.push (sfi = 0x153d2b2e1819)>

24# 0x2a47dc0023a1: 0x2a47dc0023a1

          used: 65536

25# 0x2a47dc002309: 0x2a47dc002309 <JSFunction onread (sfi = 0x2b06ad897511)>

26# 0x4639685ef51: 0x4639685ef51 <JSFunction parse (sfi = 0x2008bde2cd9)>

27# 0x4639685ef99: 0x4639685ef99 <JSFunction plaintext (sfi = 0x2008bde2d81)>

28# 0x4639680e669: 0x4639680e669

29# 0x4639680e3c9: 0x4639680e3c9

      _symbols: 0x4639680e4c1 <Object map = 0x239e1ac99bc9>#61#
       _states: 0x4639680e4f9 <Object map = 0x212720f023b9>#62#
_deterministic: 1
        _state: 0x2166a5e4fe09 <Object map = 0x212720f023b9>#63#
    _callbacks: 0x4639680e569 <Object map = 0x239e1ac99229>#64#
        _stack: 0x4639680e5a1 <JSArray[5]>#65#
     _stackPtr: 1

30# 0x4639680e531: 0x4639680e531

31# 0x4639680e681: 0x4639680e681
          page: 1

32# 0x4639680e5c1: 0x4639680e5c1
     mediawiki: 0x3a904bb49a59 <Object map = 0x239e1ac9ba61>#33#

33# 0x3a904bb49a59: 0x3a904bb49a59
             $: 0x3a904bb4adb9 <Object map = 0x239e1ac9aae9>#66#
         $name: 0x3a904bb4adf1 <String[9]: mediawiki>
         $text: 0x2166a5e4fce1 <String[2]:   >
      siteinfo: 0x3a904bb49a89 <Object map = 0x239e1ac9b329>#34#
          page: 0x2166a5e4fd81 <Object map = 0x72dc3db49d1>#4#

34# 0x3a904bb49a89: 0x3a904bb49a89

35# 0x3a904bb49aa1: 0x3a904bb49aa1

36# 0x2166a5e50051: 0x2166a5e50051

37# 0x2166a5e50231: 0x2166a5e50231

38# 0x4639680e631: 0x4639680e631

          emit: 0x4639680ee09 <JSBoundFunction (BoundTargetFunction 0x38b8e643b3d9)>#67#

39# 0x4639680e651: 0x4639680e651

40# 0x22f939cea359: 0x22f939cea359

41# 0x22f939cc0341: 0x22f939cc0341

42# 0x22f939ce9ee1: 0x22f939ce9ee1

          head: 0xdc8db802201 <null>
          tail: 0xdc8db802201 <null>
        length: 0

43# 0x2a47dc0023f1: 0x2a47dc0023f1 <JSFunction wrapper (sfi = 0x2b06ad89a161)>

44# 0x10d0e7fed659: 0x10d0e7fed659 <JSArray[4]>

             0: 0x10d0e7fed6a1 <Object map = 0x239e1ac95ff9>#68#
             1: 0x10d0e7fed6c9 <Object map = 0x239e1ac9b8a9>#69#
             2: 0x10d0e7fed751 <Object map = 0x239e1ac95ff9>#70#
             3: 0x10d0e7fed779 <Object map = 0x239e1ac95ff9>#71#

45# 0x10d0e7fed531: 0x10d0e7fed531 <JSArray[1]>

             0: 0x10d0e7fed611 <Object map = 0x239e1acbc069>#72#

46# 0x10d0e7fed371: 0x10d0e7fed371

47# 0x10d0e7fed639: 0x10d0e7fed639 <JSArray[3]>

             0: 0x2166a5e44f31 <String[14]: Ginuwine songs>
             1: 0x2166a5e44ff1 <String[12]: 2003 singles>
             2: 0x2166a5e45051 <String[25]: Songs written by Ginuwine>

48# 0x10d0e7fed3c9: 0x10d0e7fed3c9 <JSArray[0]>

49# 0x10d0e7fed3e9: 0x10d0e7fed3e9 <JSArray[0]>

50# 0x10d0e7fed409: 0x10d0e7fed409 <JSArray[2]>

             0: 0x10d0e7fed491 <Object map = 0x212720f34df1>#73#
             1: 0x10d0e7fed4e1 <Object map = 0x212720f34df1>#74#

51# 0x11d500ff35b1: 0x11d500ff35b1 <JSFunction (sfi = 0x2008bde2a49)>

52# 0x38b8e6403d41: 0x38b8e6403d41 <FixedArray[282]>

             0: 0x38b8e6404621 <JSFunction (sfi = 0xdc8db807f09)>#75#
             1: 0
             2: 0x38b8e6425729 <JSObject>#0#
             3: 0x38b8e6403d41 <FixedArray[282]>#52#
             4: 0x11d53748c0f9 <JSGlobal Object>#8#
             5: 0x11d53748c641 <FixedArray[33]>#76#
             6: 0x212720f05329 <Map(HOLEY_ELEMENTS)>#77#
             7: 0xdc8db8022d1 <undefined>
             8: 0x38b8e640a779 <JSFunction ArrayBuffer (sfi = 0xdc8db8349f9)>#78#
             9: 0x212720f02d01 <Map(HOLEY_SMI_ELEMENTS)>#79#
              ...

53# 0x11d500ff42f9: 0x11d500ff42f9
      from_api: 0x4639685e1f9 <JSFunction from_api (sfi = 0x2008bde40f9)>#80#
     plaintext: 0x4639685e241 <JSFunction plaintext (sfi = 0x2008bde41a1)>#81#
       version: 0x4639680a639 <String[5]: 2.4.0>
        custom: 0x4639685e289 <JSFunction customize (sfi = 0x2008bde4249)>#82#
         parse: 0x4639685e2d1 <JSFunction parse (sfi = 0x2008bde42f1)>#83#

54# 0x11d500ff4339: 0x11d500ff4339 <JSFunction encodeStr (sfi = 0x2008bde2b89)>

55# 0x11d500ff4381: 0x11d500ff4381 <JSFunction encodeData (sfi = 0x2008bde2c31)>

56# 0x22f939cd8799: 0x22f939cd8799 <JSFunction (sfi = 0x3a4f11ea349)>

57# 0x22f939ce6f61: 0x22f939ce6f61 <FixedArray[8]>

             0

CodeObjects (0xdeadc0de length=16): 1:0x2dc31204241 2:0x2dc312f04e1 3:0x2dc312bcee1 4:0x2dc312bcee1... magic1=deadc0de magic2=deadc0de ptr1=0xdc8db802459 ptr2=(nil) ptr3=(nil) ptr4=(nil) ptr5=(nil) ptr6=(nil) ptr7=(nil) ptr8=(nil)

Illegal instruction (core dumped)

MTKnife commented 6 years ago

Oh, I forgot to say thanks as well!

I'm not having the same problem as @e501 --I'm at 7,360,000 articles and counting right now.

MTKnife commented 6 years ago

Interestingly, the app slowed down at some point (or it's slowing down gradually): in the 16 hours I was absent from work, it got through over 7,000,000 articles. In the 5.5 hours since, it's managed to handle only 300,000. Of course, I've been using the computer in the meantime, but not for anything intensive.

e501 commented 6 years ago

When I ran the script with the previous wikipedia dump (enwiki-20171020-pages-articles.xml.bz2) and not with the Redis option, the code ran up to 9314943 articles. Seems that the "List of compositions by Franz Schubert" is a bit long and exceeds the default memory size allocated. From a quick web search, looks like I need to figure out how to increase the available memory, such as putting "--max_old_space_size=8192 --optimize_for_size --stack_size=8192" on the command line or somehow incorporated into the package installation.

Many thanks again for providing the script. The ability to work with the JSON formated parsing results has already been a great help !

Looking forward to eventually having the entire dump of articles available, once we are able to work through the various details.

The following is the last few lines output to the terminal before crashing:

Akdam, Alanya 9314940 Akçatı, Alanya 9314941 Alacami, Alanya 9314942 List of compositions by Franz Schubert 9314943

<--- Last few GCs --->

[28360:0x371f240] 108080669 ms: Mark-sweep 1308.6 (1424.9) -> 1308.6 (1424.9) MB, 644.9 / 0.0 ms allocation failure GC in old space requested [28360:0x371f240] 108081313 ms: Mark-sweep 1308.6 (1424.9) -> 1308.6 (1423.9) MB, 643.7 / 0.0 ms last resort GC in old space requested [28360:0x371f240] 108081956 ms: Mark-sweep 1308.6 (1423.9) -> 1308.6 (1423.9) MB, 643.1 / 0.0 ms last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0xf789d0a5729 1: push(this=0xc613fe1f3b9 <JSArray[257594]>) 2: infobox(aka parse_recursive) [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/wtf_wikipedia/src/parse/infobox/index.js:~16] [pc=0x247a2c9f340f](this=0x235a6b03d599 ,r=0x1efbfb82bd01 ,wiki=0x36e488a82201 <Very long string[804228]>,options=0x37fc315958c1 <Object map = 0x7...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 1: node::Abort() [node] 2: 0x11dd81c [node] 3: v8::Utils::ReportOOMFailure(char const, bool) [node] 4: v8::internal::V8::FatalProcessOutOfMemory(char const, bool) [node] 5: v8::internal::Factory::NewUninitializedFixedArray(int) [node] 6: 0xde8b7f [node] 7: 0xdfcaa5 [node] 8: v8::internal::JSObject::AddDataElement(v8::internal::Handle, unsigned int, v8::internal::Handle, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow) [node] 9: v8::internal::Object::AddDataProperty(v8::internal::LookupIterator, v8::internal::Handle, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow, v8::internal::Object::StoreFromKeyed) [node] 10: v8::internal::Object::SetProperty(v8::internal::LookupIterator, v8::internal::Handle, v8::internal::LanguageMode, v8::internal::Object::StoreFromKeyed) [node] 11: v8::internal::Runtime_SetProperty(int, v8::internal::Object*, v8::internal::Isolate) [node] 12: 0x247a0e6042fd Aborted (core dumped)

Note that I am also having some sort of node path related to running woth the redis option (i.e. "--worker). The following is printed to the terminal: module.js:544 throw err; ^

Error: Cannot find module 'commander' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/home/jayson/wikipedia-to-mongodb/bin/wp2mongo.js:2:15) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3)

spencermountain commented 6 years ago

haha, yeah, wikipedia struggles with that one too ;) image

fixed some bugs, and release a new version, 2.3.0.

@e501 i couldn't reproduce any issues with those articles, but there have been some fixes in wtf_wikipedia in the latest version. (re-)doing npm install should fix the cannot-find 'commander' issue, i believe.

this supports skipping redirects and disambiguation pages.

lemme know if it helps.

MTKnife commented 6 years ago

Great, thanks!

For the record, I got a crash similar to @e501's, but in my case it got a bit further, to article 9891192, Mansarovar Park (Delhi Metro)--that's not a long article, so I'm not sure what was going on.

spencermountain commented 6 years ago

you know what, if nobody has gotten the redis-version to work, i bet it makes a lot of sense to try one of these cool multi-threaded node libraries, i bet that would help this situation greatly.

i have a little bit of history using these, but the tricky part is that xml-stream runs at its own pace - so somehow we'd need to get the xml-streaming, and article-parsing, running in time with each other. I'm not sure how to do this. I think the redis version writes things to a temporary mongo db? I'm actually not sure.

MTKnife commented 6 years ago

OK, just started it up with this command (note that the syntax you use in the README, creating a JSON object and then adding it after the command in parentheses, doesn't work in Windows, at least not with the compiled "wp2mongo" file):

wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' --db='enwiki' --skip_redirects=true --skip_disambig=true --worker

Before doing that, I started up kue, and then, after starting wp2mongo, I started up worker.js. Never got any output in either of those two windows, except for "Running on http://127.0.0.1:3000" in the kue window.

In any case, wp2mongo crashed pretty quickly:

Playstation 16200 Pterodactylus 16201 Pterosaur 16202

<--- Last few GCs --->

[2240:000001E8B40C0AF0] 203715 ms: Mark-sweep 1403.0 (1478.1) -> 1402.9 (1475.1) MB, 703.5 / 0.0 ms (+ 0.0 ms in 0 st eps since start of marking, biggest step 0.0 ms, walltime since start of marking 704 ms) last resort GC in old space req uested [2240:000001E8B40C0AF0] 204407 ms: Mark-sweep 1402.9 (1475.1) -> 1403.0 (1475.1) MB, 692.4 / 0.0 ms last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0000038F45DA5EC1 0: builtin exit frame: stringify(this=0000038F45D89021 ,0000016070002311 , 0000016070002311 ,000001A2457BAE01 )

1: arguments adaptor frame: 1->3
2: update [C:\code\JavaScript\wikipedia-to-mongodb\node_modules\kue\lib\queue\job.js:~827] [pc=000003C77E9BD5FC](thi

s=0000021750F36821 <Job map = 000003F10...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Next, I tried it without the Redis:

wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' --db='enwiki' --skip_redirects=true --skip_disambig=true

So far, that's working, so I'm hopeful, but it won't have finished the ingest until tomorrow morning, so I'll post an update then.

BTW, what does the ""db" option do? Does it in any way affect the processing time?

MTKnife commented 6 years ago

OK, it's been running since this afternoon, and it's now at over 5.7 million articles; double-checking in the MongoDB client, I see the count is the same there--so the figure is the number of actual articles, not the number before redirects and disambiguation pages have been excluded. Since the English Wikipedia currently has only just over 5.5 million articles, something is obviously wrong.

I think I typed the command-line options wrong, but I'm not sure how to do it right: the README has two hyphens in front of them, but the app's internal help specifies one hyphen--and it's very confusing about whether or not there are supposed to be arguments ("true") after "skip_redirects" and "skip_disambig".

I'm trying this right now:

PS C:\code\JavaScript\wikipedia-to-mongodb> wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' -skip_redirects -skip_disambig

At any rate, given that it takes hours to verify if the app is proceeding correctly, it might be useful to have it throw an error in the case of invalid options, rather than running normally with no feedback.

MTKnife commented 6 years ago

Nope, that didn't work: woke up and we were at 7 million articles.

Trying it with 2 hyphens after I get to work.

MTKnife commented 6 years ago

Ah, I see now how to check for redirects directly (no pun intended). None of these commands succeeds in skipping them:

wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' --skip_redirects --skip_disambig wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' --skip_redirects true --skip_disambig true wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' -skip_redirects true -skip_disambig true wp2mongo 'C:/data/Wikipedia/enwiki-20171103-pages-articles.xml.bz2' -skip_redirects=true -skip_disambig=true

I'm at a loss here. What do I need to type?

MTKnife commented 6 years ago

There we go....The options assignment to the commander object ("program") in wp2mongo.js needs to look like this:

program .usage('node index.js enwiki-latest-pages-articles.xml.bz2 [options]') .option('-w, --worker', 'Use worker (redis required)') .option('-plain, --plaintext', 'if true, store plaintext wikipedia articles') .option('--skip_redirects', 'if true, skips-over pages that are redirects') .option('--skip_disambig', 'if true, skips-over disambiguation pages') .parse(process.argv)

Note that, because of the way the app is structured, with the "skip" options enabled, the article numbers in the console output will no longer match the number of records in the database.

Oddly, the change doesn't seem to have made a difference in the Redis version. I can't figure out the commander syntax well enough to understand why the old version was working for the "worker" option and not for the "skip" options.

spencermountain commented 6 years ago

thanks scott. The skipping thing works in the test here, but yeah, shoulda tested the cli version too.

lemme know if this gets further without a slowdown, that would be great.

fwiw, i just downloaded a en-wikipedia dump (finally) last night, so i can start stressing it out too. cheers

e501 commented 6 years ago

Many thanks again for creating and sharing your parser. Good news is that we are making progress.

The latest script (version, 2.3.0) parsed up to 9310799 articles of the enwiki-20171103-pages-articles.xml.bz2 dump. Bad news is that the "Punk rock in France" article caused a core dump (see below). The article is not very large. My machine has 64GB of RAM and lots of swap space. My thinking is that even if there is a memory leak, I could try making the javascript heap extremely large.

Thus, still seems that I may need to figure out how to increase the available memory, such as putting "--max_old_space_size=8192 --optimize_for_size --stack_size=8192" on the command line or somehow incorporated into the package installation.

The immediate goal is to be able to do a complete run through the most recent dump of Wikipedia. From tracking my mongodb log, the latest version seems to be logging the articles that may have had issues. Please confirm that is how I can go back and track down the "problem articles" that are not parsed and stored in the mongodb db. Using the mediawiki API, these articles can be addressed on a case-by-case basis.

Also, I would like to explore the possibility of working with the javascript code for minor tailoring and augmenting of the functionality. For example, "mediawiki template" (e.g. navigation template) parsing capabilities would be a big plus. Problem is that I already need to ramp up on PHP, while still trying to continue to make progress with using R/RStudio (and Python) packages for NLP and statistical analysis. Being new to javascript (and node.js), initial web searches indicate that PHPStorm seems to be a viable IDE for both PHP and javascript (and node.js). Feedback and even a rough-draft HowTo for setting up the IDE and edit-build-test cycle would help scope out the viability of being able to work more directly with the source code.

The output of the core dump follows:

Aaron Sopher 9310797 Danny Talbot 9310798 Punk rock in France 9310799

<--- Last few GCs --->

[19270:0x2a732c0] 105842469 ms: Mark-sweep 1293.2 (1449.7) -> 1293.2 (1449.7) MB, 646.8 / 0.0 ms allocation failure GC in old space requested [19270:0x2a732c0] 105843118 ms: Mark-sweep 1293.2 (1449.7) -> 1293.2 (1416.7) MB, 649.0 / 0.0 ms last resort GC in old space requested [19270:0x2a732c0] 105843773 ms: Mark-sweep 1293.2 (1416.7) -> 1293.2 (1416.7) MB, 654.9 / 0.0 ms last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x3f7f6e425729 1: f [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:~25] [pc=0x91227a48be9](this=0x187bf410c0f9 ,b=229) 2: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/lib/bzip2.js:~163] [pc=0x91215c8af3e](this=0x41292882249 ,bits=0x20d4e263cbd9 <JSFunction f (sfi = 0x11...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 1: node::Abort() [node] 2: 0x11dd81c [node] 3: v8::Utils::ReportOOMFailure(char const, bool) [node] 4: v8::internal::V8::FatalProcessOutOfMemory(char const, bool) [node] 5: v8::internal::Factory::NewUninitializedFixedArray(int) [node] 6: 0xde8473 [node] 7: v8::internal::Runtime_GrowArrayElements(int, v8::internal::Object*, v8::internal::Isolate) [node] 8: 0x9120b2042fd Aborted (core dumped)

MTKnife commented 6 years ago

Still running, but one finding: I started it about 11am, and it was at over 6 million articles (including all articles--that's under 2.5 million inserted in the database) when I left work at 6:00pm, with articles streaming by faster than I could read the titles. When I got home at 7:00, I glanced at it and it seemed to be moving slowly, and three hours later, the articles were scrolling past noticeably more slowly, and it still hadn't hit 7 million. I haven't been monitoring it the whole time, but it seems like it might have slowed down suddenly, rather than the gradual slowing that would have been the case if the database indexing were to blame.

I apparently forgot to hit the "Comment" button and post the above, so I'll add this update: as of 2:00pm, I'm at 7.95 million, which means that it's proceeding at about the same speed it did before the "skip" options were added. I assume that it's going to crash at around 10 million, since that's what it just did for @e501, and for both of us the last time through.

I'm mystified that adding the "skip" options didn't speed it up, because that implies that there's something about the JavaScript app itself that makes it go slower for the later articles. I wonder if whatever it is is related to the crashes?

MTKnife commented 6 years ago

The crash came just now, at 9906759, at "List of United States counties and county equivalents". For the record, we got 4.1 million articles into the database before that happened.

spencermountain commented 6 years ago

hey cool, yeah we're getting pretty close. Also, it seems pretty likely there is a memory leak somewhere. it would be great if we could find the memory leak. There are some tools for identifying this in node applications. It's also good news that it's not a mongo-related slowdown. That db can handle a lot more action than this, i bet.

maybe it's a good idea to add more --skip flags, to pass-through to wtf_wikipedia. If you're not, say, using the citations, or images, we could skip those steps in the parser.

but yeah, the --max-old-space-size stuff could also get to the end. That's worth a shot. I'll try that mem-leak tool today, if i can find some time.

oh hey, didn't someone say they tried adding a condition in the parser to skip the first-n pages, but it still collapsed? I mean, it would be interesting to skip the first 4m articles, and see if you can get the last-half done on a second pass. If anything, that would isolate our mem-leak a little. ok, cheers!

spencermountain commented 6 years ago

ah, it's not a memory-leak. just tested it. I think i know what's happening.

for each page, the xml-parser isn't waiting for the mongo insert to complete before continuing. I think it's just building-up pending mongo-inserts, after a long time. That would make sense.

the good news is that means you guys should be able skip-n articles and complete the remaining ones, without any memory problems. I'll implement a skip-n flag now.

MTKnife commented 6 years ago

OK, thanks, that sounds good.

You should be able to tell it to wait on insertions so that we don't run into the problem in the future, right?

spencermountain commented 6 years ago

hey, so in 2.4.0 it now 'takes a break' every 30s, for 3 seconds. It ran-though the simple-english wiki without any issue.

i also added the --skip_first 500 option, so you can just breeze-through the first 4m or whatever.

changed the logging a bit, so that errors could be found easier. turn on --verbose if you want to see the articles cheers

MTKnife commented 6 years ago

OK, I'll give it a try right now.

MTKnife commented 6 years ago

Foo. It's running, but it's not making insertions in the database.

spencermountain commented 6 years ago

did you add a skip parameter?

MTKnife commented 6 years ago

Yes, told it to skip the first 9,906,000--since it last stopped at 9,906,759, it should have started adding within 1,000 articles.

Where would the log be, BTW?

spencermountain commented 6 years ago

if you tell it to skip the first 9 million, it won't insert anything into the database until the 9000001th article, so you'll have to wait. if you want a log, you can just put a console.log in.

MTKnife commented 6 years ago

Ah....Yeah, I guess there's no way around that, given the size of the file. We'll see if that speeds things up. I'm not sure we need the pauses in the early stages, though--I would imagine the crashes happen when the MongoDB index gets to a certain size, which slows things down too much.

BTW, why is the number of pages different in each batch? Does it grab a fixed number of lines or something like that?

MTKnife commented 6 years ago

Actually, let me try something on my fork.

MTKnife commented 6 years ago

I modified the setInterval function to avoid the pause if "i" is under 8,000,000, then to start at 1s and increase linearly from there (since the insertion time is probably O(N), with N being the index or database size). It's working, but it occurs to me that, even with a 3s pause every 30s, the thing shouldn't be proceeding much more slowly than the old version, and some quick calculations suggest it isn't--I guess it just seems slower without the article titles whizzing by.

BTW, I can't get the "--verbose" option to work, though, looking at the code (with my zero knowledge of JavaScript), I can't see why not.

spencermountain commented 6 years ago

thanks scott. lemme know if you get to the end. I'm happy to merge a pr for more sophisticated sleeping rules.

i'm gonna try this today, to see if we can speed this bad-boy up a bit.

MTKnife commented 6 years ago

Since the impact is minor (less than 10% of running time), I'm going to wait until I'm sure it works before I push it and submit a pull request. Right now, we're at 7.1 million articles.

The workerpool library looks like a good idea, but I'm still wondering what the limiting factor actually is. The current speed is not that different from the original version, even without any parsing, which suggests that the limiting factor is the disk read. That wouldn't be in the hard sense of physical I/O, but rather a matter of the overhead associated with iterating through the articles. I'm not sure if it's relevant to this or not, but I just timed the interval between messages from "logger", and it's more like 7s rather than the 5s specified in the script. I wonder if the "setInterval" loops are themselves slowing things up?

That leaves open the question of why things slow down at around 6 million articles, and maybe it has something to do with the expanding queue as MongoDB slows down with larger and larger indices? If that's the case, the workerpool thing might not help much.

Or maybe the key is to put multiple threads on the MongoDB insertion, rather than on the parser?

MTKnife commented 6 years ago

I'm close to 9.3 million this morning, so not inserting new records yet, but the modified delay code is working so far, so I'll push it and send you a pull request in a bit here.

MTKnife commented 6 years ago

OK, something's gone wonky. As I mentioned above, in the last run-through, I got to 9,906,759 articles. Once you added the skip option, I restarted and told it to skip the first 9,906,000. It passed those numbers some time yesterday, and it's over 10.1 million now--but every single insert attempt is still failing with a "duplicate key" error. I'm trying to understand how that could happen: is it possible the count now proceeds differently than it did before?

BTW, the "--verbose" option is in fact working--when I mentioned it before, I didn't realize the article names are triggered only for articles that are parsed. It would be nice, though, to return the article number in each line, because the "logger" summary output gets buried in all the article names.

MTKnife commented 6 years ago

A short update: the articles in question do actually appear to be in the database already, so the "duplicate key" error is valid.

MTKnife commented 6 years ago

I think I found where the count is going off: in the first commit of Dec. 4, the check of the "ns" (namespace) field got moved from "index.js" to "parse.js"; one consequence of that move is that it now occurs after the increment of "i" (the article count variable) rather than before.

You'd think everything in the dump would have a namespace of "0" (actual content), but evidently that's not the case.

Anyway, a more significant problem: once the app started parsing articles rather than skipping them things slowed to a crawl with only 500K or so articles processed in the last ~24 hours. I wonder if the insertion errors are slowing things down?

MTKnife commented 6 years ago

Nearly two days now since we finished skipping articles, and still no new ones: we're at 11.2 million now.

I wonder what the heck all the non-"0" records in the dump are? Given the differences in numbers, there are at least 1.5 million records that weren't counted last time, or greater than 1 in every 10 records. And yet I don't see any titles scrolling by that don't look like article titles.

EDIT: Duh...the titles of the non-"0" pages don't get printed in the log. But they do contribute to the count.

MTKnife commented 6 years ago

It crashed some time this morning, at the same article as last time, "List of United States counties and county equivalents", which is 12.7 million-something under the new numbering. IIRC, the way I had it programmed when it started, that the delay would have been a little over 1.5s at that point.

The probability that it crashed on exactly the same article by chance is vanishingly small, though I can't explain why it would have crashed on another article during my first run.

spencermountain commented 6 years ago

wow. yeah that's something. thanks for sharing this. I can't figure-out why that page triggers any problem. i tested it a bunch of ways just now i'm pretty busy the next few days. lemme know if you figure something out. Perhaps you're at the end? perhaps that page is bad-xml?

just now i opened the ./tests/smallwiki and added it to the end, zipped it back up, and run the tests, and it worked fine. sorry i can't help any further.

MTKnife commented 6 years ago

The one thing I do know is that I'm not at the end--it's only got 4.1 million records in the database, and there are supposed to be over 5.5 million articles in the English Wikipedia. Plus, in watching the articles go buy, I've noticed they seem to be in the order they were originally inserted in the wiki--and the crash appears to occur during the 2013 articles.

I don't know enough about JavaScript to have any good ideas on this end :(.

spencermountain commented 6 years ago

oh, that's good to know.

what is the error when it crashes? is it still the process out of memory one?

if so, i recommend just splitting the xml file this looks like it can split bz2 compressed xml file.

hey, if it's actually crashing on this specific place each time, and there's some kindof queue-backup, it could definetly be a few articles forward to backward from the counties-or-equivalents page. if you can find a page that throws an error, that'd be an easy fix presumably.

MTKnife commented 6 years ago

It's still the memory error, yes.

I think if it were a queue problem, it would crash on a different article once the delay has been introduced, unless (as I think you're implying) the back-up occurs very quickly, between the last pause and the parsing of the article whose title shows up when the crash occurs--and that in turn implies that the problematic article is one that's parsed after that last pause.

It's a pain to test, though, since, even with that tool you linked, it'll take a while to split the file.