Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.
There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).
I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.
Over the last few weeks, I have been using parsing results from a dump file that loaded approx. 5.7 articles a few weeks ago. For the articles that were parsed, the results have worked great with mongolite (R package).
Yesterday, when I tried to rerun wp2mongo for the same enwiki file, the code hangs at the 777th article (Ibn al-Haytham), just as described in the posting. I then tried the most recent dump of enwiki and had same result. Rebooting and reinstalling produced same results. Note that the afwiki dump loads into mongodb with no problems.
Also, at the wtf_wikipedia web page (https://spencermountain.github.io/wtf_wikipedia/), the "Algiers" and other previous articles are immediately parsed and displayed, but a fetch for the "Ibn al-Haytham" article is not responsive.
fwiw, maybe this library should wrap wtf_wikipedia in a try/catch. It would make things slower (i think?) but these sort of errors are probably somewhat inevitable, given the size & nature of wikipedia. any insight or thoughts about doing that?
From previous experience with using json parsers for Wikipedia articles, there has been an issue with extraneous characters and character combinations embedded in the articles. My thinking has been that to guard against confusing the respective parser, a preprocessor may need to validate that the string/blob of characters can be parsed, relative to the known specification and associated constraints/capabilities of the given parser. At this time, I use an ad-hoc collection of regex expressions that strip out potentially-problematic character strings.
Basically, the idea is to have something like a lint process (https://en.wikipedia.org/wiki/Lint_(software)) for flagging suspicious character combinations that may be in the input string (i.e. Wikipedia article). In this case where the parser is hanging (Ibn al-Haytham), there may be an issue with the additional types of characters that are embedded with the more usual English language characters.
With the new fix, there are new recurring parsing errors, such as the following:
Julia Kristeva 10468
Error: key 0-19-518767-9}}, [https://books.google.com/books?id must not contain '.'
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27)
Juan Miro 10469
Just intonation 10470
Josephus 10471
Jan Borukowski 10472
Judy Blume 10473
Joel Marangella 10474
John Pople 10475
Jake McDuck 10476
Jerry Falwell 10477
Jebus 10478
Jay Leno 10479
Jeroboam II 10480
JTF-CNO 10481
Joan of Arc 10482
Error: key Centre Historique des Archives Nationales]], [[Paris]], AE II 2490, dated to the second half of the 15th century.
"The later, fifteenth-century manuscript of [[Charles, Duke of Orléans]] contains a miniature of Joan in armour; the face has certain characteristic features known from her contemporaries' descriptions, and the artist may have worked from indications by someone who had known her." Joan M. Edmunds, The Mission of Joan of Arc (2008), [https://books.google.com/books?id must not contain '.'
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27)
Johannes Nicolaus Brønsted 10483
Janus kinase 10484
Jacob Grimm 10485
Jamiroquai 10486
John Sutter 10487
John Adams (composer) 10488
Jon Voight 10489
Error: key 978-0-306-80900-2}}, p.236. [https://books.google.com/books?id must not contain '.'
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17)
at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18)
at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17)
at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27)
John Climacus 10490
John of the Ladder 10491
At the wtf_wikipedia web page, the parser looks to be returning partial results for the problematic articles, wherein some of the sections look to be missing some content.
On the subject of the try/catch block, I did a quick search, and what I found suggests that what's true for Python (with which I'm familiar) is also true of JavaScript (with which I'm not familiar): unless there's very little happening in the try/catch block, the cost is trivial when there's no error. I think, given the size and complexity of Wikipedia, especially when you consider the multiple available languages, that you really should have error-catching, or you're likely to run into this kind of problem again. It would probably also be a good idea to produce an output that indicates any articles that have been skipped.
@e501 fixed the encodings for citation keys, and @MTKnife added try/catch statements around the wikipedia parsing stuff. 2.2.0 tests passing. lemme know
I'm running it with Redis right now, and I think it's working, but I'm not entirely sure what I'm supposed to be seeing. I'm no longer getting errors (or any feedback whatsoever) when I run "worker.js", and the "wikipedia" collection in the "wikipedia_queue" database is getting steadily larger. However, I've noticed that the database is initiated and insertions start to take place before I run "worker.js", which leads me to assume that "wp2mongo" is inserting unparsed articles in the db. Presumably "worker.js" then goes back and parses those articles and modifies the records in question--so what should I look for to be sure that the records I'm seeing have been parsed?
Whoops...we just got some kind of Redis error. Here's what it looks like:
La Puente, California 71751
La Verne, California 71752
Ladera Heights, California 71753
events.js:183
throw er; // Unhandled 'error' event
^
Error: Redis connection to localhost:6379 failed - read ECONNRESET
at _errnoException (util.js:1024:11)
at TCP.onread (net.js:615:25)
Since I'm about to leave work, I'm re-running it without the Redis, and so far that's working.
UPDATE: The Redis thing appears to be some kind of issue external to wp2mongo. See this link.
Many thanks for your rapid turnaround on the most recent fixes. This evening, I was able to process up to 5508058 articles from the enwiki-20171120-pages-articles.xml.bz2 dump. Unfortunately, right after the "Love You More (Ginuwine song) 5508058" string was output to the terminal, the following core dump occurred:
[2]: parse [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/doPage.js:64] [bytecode=0x2b06ad8b2651 offset=50](this=0x11d500ff4251 #1#,options=0xdc8db802459 ,cb=0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2#) {
// stack-allocated locals
var data = 0x10d0e7fed429 #16#
// heap-allocated locals
var cb = 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2#
// expression stack (top to bottom)
[05] : 0xdc8db802569
[04] : 0xdc8db802569
[03] : 0xdc8db802569
[02] : 0xdc8db802569
[01] : 0x11d500ff3569 <FixedArray[7]>#17#
--------- s o u r c e c o d e ---------
function parse(options, cb) {\x0a let data = wtf.parse(options.script);\x0a data = encodeData(data);\x0a data.title = options.title;\x0a data._id = encodeStr(options.title);\x0a // options.collection.update({ _id: data._id }, data, { upsert: true }, function(e) {\x0a options.collection.insert(data, function(e) {\x0a if (e) {...
}
[3]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/index.js:75] [bytecode=0x2b06ad8b1091 offset=298](this=0x4639680e1f1 #3#,page=0x2166a5e4fd81 #4#) {
// expression stack (top to bottom)
[11] : 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>#2#
[10] : 0xdc8db802459
[09] : 0x11d500ff4251 #1#
[08] : 0xdc8db802569
[07] : 0xdc8db802569
[06] : 0xdc8db802569
[05] : 0xdc8db802569
[04] : 0xdc8db802569
[03] : 0xdc8db802569
[02] : 0x4639680f091 <FixedArray[8]>#18#
[01] : 0xdc8db802569
[00] : 0xdc8db802569
--------- s o u r c e c o d e ---------
function (page) {\x0a if (page.ns === '0') {\x0a let script = page.revision.text['$text'] || '';\x0a\x0a console.log(leftPad(page.title) + ' ' + i);\x0a ++i;\x0a\x0a let data = {\x0a title: page.title,\x0a script: script\x0a };\x0a\x0a if (obj.worker) {\x0a // we send job t...
}
[4]: arguments adaptor frame: 3->1 {
// actual arguments
[00] : 0x2166a5e4fd81 #4#
[01] : 0x2166a5e4fd49 #6# // not passed to callee
[02] : 0x4639680e349 #7# // not passed to callee
}
[5]: emitThree(aka emitThree) [events.js:~143] [pc=0x2dc36ad157b](this=0xdc8db8022d1 ,handler=0x4639680ef11 <JSFunction (sfi = 0x2b06ad88d501)>#5#,isFn=0xdc8db802371 ,self=0x4639680e1f1 #3#,arg1=0x2166a5e4fd81 #4#,arg2=0x2166a5e4fd49 #6#,arg3=0x4639680e349 #7#) {
// optimized frame
--------- s o u r c e c o d e ---------
function emitThree(handler, isFn, self, arg1, arg2, arg3) {\x0a if (isFn)\x0a handler.call(self, arg1, arg2, arg3);\x0a else {\x0a var len = handler.length;\x0a var listeners = arrayClone(handler, len);\x0a for (var i = 0; i < len; ++i)\x0a listeners[i].call(self, arg1, arg2, arg3);\x0a }\x0a}
}
[6]: fn [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~131] [pc=0x2dc3133b320](this=0x11d53748c0f9 #8#,element=0x2166a5e4fd81 #4#,context=0x2166a5e4fd49 #6#,trace=0x4639680e349 #7#) {
// optimized frame
--------- s o u r c e c o d e ---------
function fn(element, context, trace) {\x0a self.emit(event.name, element, context, trace);\x0a }
}
[7]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~401] [pc=0x2dc31356b67](this=0x4639680e2b9 #9#,name=0x2166a5e41669 <String[4]: page>) {
// optimized frame
--------- s o u r c e c o d e ---------
function (name) {\x0a self.emit('endElement', name);\x0a var prev = stack.pop();\x0a var element = curr.element;\x0a var text = curr.fullText;\x0a var attr = element.$;\x0a if (typeof attr !== 'object') {\x0a attr = {};\x0a }\x0a var name = element.$name;\x0a self._name = name;\x0a delete element.$;\x0a de...
}
[8]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x4639680e2b9 #9#,/ anonymous /=0x2166a5e41649 <String[10]: endElement>) {
// optimized frame
--------- s o u r c e c o d e ---------
function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...
}
[9]: arguments adaptor frame: 2->1 {
// actual arguments
[01] : 0x2166a5e41669 <String[4]: page> // not passed to callee
}
[13]: / anonymous / [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/xml-stream/lib/xml-stream.js:~519] [pc=0x2dc36afc747](this=0x22f939cc0011 #10#,data=0x2166a5e0a251 #11#) {
// optimized frame
--------- s o u r c e c o d e ---------
function (data) {\x0a if (self._encoding) {\x0a parseChunk(data);\x0a } else {\x0a // We can't parse when the encoding is unknown, so we'll look into\x0a // the XML declaration, if there is one. For this, we need to buffer\x0a // incoming data until a full tag is received.\x0a preludeBuffers.push(d...
}
[14]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0011 #10#,/ anonymous /=0x38b8e6434769 <String[4]: data>) {
// optimized frame
--------- s o u r c e c o d e ---------
function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...
}
[15]: arguments adaptor frame: 2->1 {
// actual arguments
[01] : 0x2166a5e0a251 #11# // not passed to callee
}
[16]: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/unbzip2-stream/index.js:~64] [pc=0x2dc36ae48a3](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) {
// optimized frame
--------- s o u r c e c o d e ---------
function write(data) {\x0a //console.error('received', data.length,'bytes in', typeof data);\x0a bufferQueue.push(data);\x0a hasBytes += data.length;\x0a if (bitReader === null) {\x0a bitReader = bitIterator(function() {\x0a return bufferQueue.shift();\x0a ...
}
[17]: write [/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/through/index.js:~25] [pc=0x2dc319cb8c0](this=0x22f939cc0011 #10#,data=0x2a47dc0022b9 #12#) {
// optimized frame
--------- s o u r c e c o d e ---------
function (data) {\x0a write.call(this, data)\x0a return !stream.paused\x0a }
}
[18]: ondata [_stream_readable.js:642] [bytecode=0x2b06ad89aad1 offset=30](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#) {
// stack-allocated locals
var ret = 0xdc8db8022d1
// expression stack (top to bottom)
[05] : 0x2a47dc0022b9 #12#
[04] : 0x22f939cc0011 #10#
[03] : 0xdc8db8022d1
[02] : 0x22f939cc0011 #10#
[01] : 0x2b06ad892bb9 <JSFunction stream.write (sfi = 0x2b06ad892131)>#19#
--------- s o u r c e c o d e ---------
function ondata(chunk) {\x0a debug('ondata');\x0a increasedAwaitDrain = false;\x0a var ret = dest.write(chunk);\x0a if (false === ret && !increasedAwaitDrain) {\x0a // If the user unpiped during dest.write(), it is possible\x0a // to get stuck in a permanently paused state if that write\x0a // also returne...
}
[19]: emit [events.js:~165] [pc=0x2dc31394c09](this=0x22f939cc0271 #13#,/ anonymous /=0x38b8e6434769 <String[4]: data>) {
// optimized frame
--------- s o u r c e c o d e ---------
function emit(type, ...args) {\x0a let doError = (type === 'error');\x0a\x0a const events = this._events;\x0a if (events !== undefined)\x0a doError = (doError && events.error === undefined);\x0a else if (!doError)\x0a return false;\x0a\x0a const domain = this.domain;\x0a\x0a // If there is no 'error' event listener then throw.\x0a if ...
}
[20]: arguments adaptor frame: 2->1 {
// actual arguments
[01] : 0x2a47dc0022b9 #12# // not passed to callee
}
[01] : 0xdc8db8022d1
[00] : 0x38b8e643b3d9 <JSFunction emit (sfi = 0x38b8e6439ee1)>#20#
--------- s o u r c e c o d e ---------
function addChunk(stream, state, chunk, addToFront) {\x0a if (state.flowing && state.length === 0 && !state.sync) {\x0a stream.emit('data', chunk);\x0a stream.read(0);\x0a } else {\x0a // update the buffer info.\x0a state.length += state.objectMode ? 1 : chunk.length;\x0a if (addToFront)\x0a state.buffer.unshift(chunk...
}
[22]: readableAddChunk(aka readableAddChunk) [_stream_readable.js:252] [bytecode=0xfc3902f2789 offset=377](this=0xdc8db8022d1 ,stream=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ,addToFront=0xdc8db8023e1 ,skipChunkCheck=0xdc8db8022d1 ) {
// stack-allocated locals
var state = 0x22f939cc0359 #14#
var er = 0xdc8db8022d1
// expression stack (top to bottom)
[11] : 0xdc8db8023e1
[10] : 0x2a47dc0022b9 #12#
[09] : 0x22f939cc0359 #14#
[08] : 0x22f939cc0271 #13#
[07] : 0xdc8db8022d1
[06] : 0xdc8db8023e1
[05] : 0x2a47dc0022b9 #12#
[04] : 0x22f939cc0359 #14#
[03] : 0x22f939cc0271 #13#
[02] : 0x5b65f2b92b1 <JSFunction addChunk (sfi = 0x153d2b2e0859)>#21#
--------- s o u r c e c o d e ---------
function readableAddChunk(stream, chunk, encoding, addToFront, skipChunkCheck) {\x0a var state = stream._readableState;\x0a if (chunk === null) {\x0a state.reading = false;\x0a onEofChunk(stream, state);\x0a } else {\x0a var er;\x0a if (!skipChunkCheck)\x0a er = chunkInvalid(state, chunk);\x0a if (er) {\x0a stream.emit('error...
}
[23]: push [_stream_readable.js:209] [bytecode=0xfc3902f2481 offset=89](this=0x22f939cc0271 #13#,chunk=0x2a47dc0022b9 #12#,encoding=0xdc8db8022d1 ) {
// stack-allocated locals
var state = 0x22f939cc0359 #14#
var skipChunkCheck = 0xdc8db8022d1
// expression stack (top to bottom)
[13] : 0xdc8db8022d1
[12] : 0xdc8db8023e1
[11] : 0xdc8db8022d1
[10] : 0x2a47dc0022b9 #12#
[09] : 0x22f939cc0271 #13#
[08] : 0xdc8db8022d1
[07] : 0xdc8db8022d1
[06] : 0xdc8db8023e1
[05] : 0xdc8db8022d1
[04] : 0x2a47dc0022b9 #12#
[03] : 0x22f939cc0271 #13#
[02] : 0x5b65f2b9269 <JSFunction readableAddChunk (sfi = 0x153d2b2e07b1)>#22#
--------- s o u r c e c o d e ---------
function (chunk, encoding) {\x0a var state = this._readableState;\x0a var skipChunkCheck;\x0a\x0a if (!state.objectMode) {\x0a if (typeof chunk === 'string') {\x0a encoding = encoding || state.defaultEncoding;\x0a if (encoding !== state.encoding) {\x0a chunk = Buffer.from(chunk, encoding);\x0a encoding = ...
[25]: onread(aka onread) [fs.js:2095] [bytecode=0x2b06ad89a6f1 offset=122](this=0xdc8db8022d1 ,er=0xdc8db802201 ,bytesRead=65536) {
// stack-allocated locals
var b = 0x2a47dc0022b9 #12#
// expression stack (top to bottom)
[06] : 0x2a47dc0022b9 #12#
[05] : 0x22f939cc0271 #13#
[04] : 65536
[03] : 0
[02] : 0x22f939cc0271 #13#
[01] : 0x153d2b2e4b79 <JSFunction Readable.push (sfi = 0x153d2b2e1819)>#23#
--------- s o u r c e c o d e ---------
function onread(er, bytesRead) {\x0a if (er) {\x0a if (self.autoClose) {\x0a self.destroy();\x0a }\x0a self.emit('error', er);\x0a } else {\x0a var b = null;\x0a if (bytesRead > 0) {\x0a self.bytesRead += bytesRead;\x0a b = thisPool.slice(start, start + bytesRead);\x0a }\x0a\x0a self.push(b)...
}
[26]: arguments adaptor frame: 3->2 {
// actual arguments
[00] : 0xdc8db802201
[01] : 65536
[02] : 0x2a47dc0023a1 #24# // not passed to callee
}
[27]: oncomplete(aka wrapper) [fs.js:676] [bytecode=0x2b06ad89a571 offset=23](this=0x2a47dc002479 #15#,err=0xdc8db802201 ,bytesRead=65536) {
// expression stack (top to bottom)
[07] : 0x2a47dc0023a1 #24#
[06] : 65536
[05] : 0xdc8db802201
[04] : 0xdc8db8022d1
[03] : 0x2a47dc0023a1 #24#
[02] : 65536
[01] : 0xdc8db802201
[00] : 0x2a47dc002309 <JSFunction onread (sfi = 0x2b06ad897511)>#25#
--------- s o u r c e c o d e ---------
function wrapper(err, bytesRead) {\x0a // Retain a reference to buffer so that it can't be GC'ed too soon.\x0a callback && callback(err, bytesRead || 0, buffer);\x0a }
Interestingly, the app slowed down at some point (or it's slowing down gradually): in the 16 hours I was absent from work, it got through over 7,000,000 articles. In the 5.5 hours since, it's managed to handle only 300,000. Of course, I've been using the computer in the meantime, but not for anything intensive.
When I ran the script with the previous wikipedia dump (enwiki-20171020-pages-articles.xml.bz2) and not with the Redis option, the code ran up to 9314943 articles. Seems that the "List of compositions by Franz Schubert" is a bit long and exceeds the default memory size allocated. From a quick web search, looks like I need to figure out how to increase the available memory, such as putting "--max_old_space_size=8192 --optimize_for_size --stack_size=8192" on the command line or somehow incorporated into the package installation.
Many thanks again for providing the script. The ability to work with the JSON formated parsing results has already been a great help !
Looking forward to eventually having the entire dump of articles available, once we are able to work through the various details.
The following is the last few lines output to the terminal before crashing:
Akdam, Alanya 9314940
Akçatı, Alanya 9314941
Alacami, Alanya 9314942
List of compositions by Franz Schubert 9314943
<--- Last few GCs --->
[28360:0x371f240] 108080669 ms: Mark-sweep 1308.6 (1424.9) -> 1308.6 (1424.9) MB, 644.9 / 0.0 ms allocation failure GC in old space requested
[28360:0x371f240] 108081313 ms: Mark-sweep 1308.6 (1424.9) -> 1308.6 (1423.9) MB, 643.7 / 0.0 ms last resort GC in old space requested
[28360:0x371f240] 108081956 ms: Mark-sweep 1308.6 (1423.9) -> 1308.6 (1423.9) MB, 643.1 / 0.0 ms last resort GC in old space requested
Note that I am also having some sort of node path related to running woth the redis option (i.e. "--worker). The following is printed to the terminal:
module.js:544
throw err;
^
Error: Cannot find module 'commander'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object. (/home/jayson/wikipedia-to-mongodb/bin/wp2mongo.js:2:15)
at Module._compile (module.js:641:30)
at Object.Module._extensions..js (module.js:652:10)
at Module.load (module.js:560:32)
at tryModuleLoad (module.js:503:12)
at Function.Module._load (module.js:495:3)
haha, yeah, wikipedia struggles with that one too ;)
fixed some bugs, and release a new version, 2.3.0.
@e501 i couldn't reproduce any issues with those articles, but there have been some fixes in wtf_wikipedia in the latest version. (re-)doing npm install should fix the cannot-find 'commander' issue, i believe.
this supports skipping redirects and disambiguation pages.
For the record, I got a crash similar to @e501's, but in my case it got a bit further, to article 9891192, Mansarovar Park (Delhi Metro)--that's not a long article, so I'm not sure what was going on.
you know what, if nobody has gotten the redis-version to work, i bet it makes a lot of sense to try one of these coolmulti-threadednode libraries, i bet that would help this situation greatly.
i have a little bit of history using these, but the tricky part is that xml-stream runs at its own pace - so somehow we'd need to get the xml-streaming, and article-parsing, running in time with each other. I'm not sure how to do this.
I think the redis version writes things to a temporary mongo db? I'm actually not sure.
OK, just started it up with this command (note that the syntax you use in the README, creating a JSON object and then adding it after the command in parentheses, doesn't work in Windows, at least not with the compiled "wp2mongo" file):
Before doing that, I started up kue, and then, after starting wp2mongo, I started up worker.js. Never got any output in either of those two windows, except for "Running on http://127.0.0.1:3000" in the kue window.
In any case, wp2mongo crashed pretty quickly:
Playstation 16200
Pterodactylus 16201
Pterosaur 16202
<--- Last few GCs --->
[2240:000001E8B40C0AF0] 203715 ms: Mark-sweep 1403.0 (1478.1) -> 1402.9 (1475.1) MB, 703.5 / 0.0 ms (+ 0.0 ms in 0 st
eps since start of marking, biggest step 0.0 ms, walltime since start of marking 704 ms) last resort GC in old space req
uested
[2240:000001E8B40C0AF0] 204407 ms: Mark-sweep 1402.9 (1475.1) -> 1403.0 (1475.1) MB, 692.4 / 0.0 ms last resort GC in
old space requested
OK, it's been running since this afternoon, and it's now at over 5.7 million articles; double-checking in the MongoDB client, I see the count is the same there--so the figure is the number of actual articles, not the number before redirects and disambiguation pages have been excluded. Since the English Wikipedia currently has only just over 5.5 million articles, something is obviously wrong.
I think I typed the command-line options wrong, but I'm not sure how to do it right: the README has two hyphens in front of them, but the app's internal help specifies one hyphen--and it's very confusing about whether or not there are supposed to be arguments ("true") after "skip_redirects" and "skip_disambig".
At any rate, given that it takes hours to verify if the app is proceeding correctly, it might be useful to have it throw an error in the case of invalid options, rather than running normally with no feedback.
There we go....The options assignment to the commander object ("program") in wp2mongo.js needs to look like this:
program
.usage('node index.js enwiki-latest-pages-articles.xml.bz2 [options]')
.option('-w, --worker', 'Use worker (redis required)')
.option('-plain, --plaintext', 'if true, store plaintext wikipedia articles')
.option('--skip_redirects', 'if true, skips-over pages that are redirects')
.option('--skip_disambig', 'if true, skips-over disambiguation pages')
.parse(process.argv)
Note that, because of the way the app is structured, with the "skip" options enabled, the article numbers in the console output will no longer match the number of records in the database.
Oddly, the change doesn't seem to have made a difference in the Redis version. I can't figure out the commander syntax well enough to understand why the old version was working for the "worker" option and not for the "skip" options.
Many thanks again for creating and sharing your parser. Good news is that we are making progress.
The latest script (version, 2.3.0) parsed up to 9310799 articles of the enwiki-20171103-pages-articles.xml.bz2 dump. Bad news is that the "Punk rock in France" article caused a core dump (see below). The article is not very large. My machine has 64GB of RAM and lots of swap space. My thinking is that even if there is a memory leak, I could try making the javascript heap extremely large.
Thus, still seems that I may need to figure out how to increase the available memory, such as putting "--max_old_space_size=8192 --optimize_for_size --stack_size=8192" on the command line or somehow incorporated into the package installation.
The immediate goal is to be able to do a complete run through the most recent dump of Wikipedia. From tracking my mongodb log, the latest version seems to be logging the articles that may have had issues. Please confirm that is how I can go back and track down the "problem articles" that are not parsed and stored in the mongodb db. Using the mediawiki API, these articles can be addressed on a case-by-case basis.
Also, I would like to explore the possibility of working with the javascript code for minor tailoring and augmenting of the functionality. For example, "mediawiki template" (e.g. navigation template) parsing capabilities would be a big plus. Problem is that I already need to ramp up on PHP, while still trying to continue to make progress with using R/RStudio (and Python) packages for NLP and statistical analysis. Being new to javascript (and node.js), initial web searches indicate that PHPStorm seems to be a viable IDE for both PHP and javascript (and node.js). Feedback and even a rough-draft HowTo for setting up the IDE and edit-build-test cycle would help scope out the viability of being able to work more directly with the source code.
The output of the core dump follows:
Aaron Sopher 9310797
Danny Talbot 9310798
Punk rock in France 9310799
<--- Last few GCs --->
[19270:0x2a732c0] 105842469 ms: Mark-sweep 1293.2 (1449.7) -> 1293.2 (1449.7) MB, 646.8 / 0.0 ms allocation failure GC in old space requested
[19270:0x2a732c0] 105843118 ms: Mark-sweep 1293.2 (1449.7) -> 1293.2 (1416.7) MB, 649.0 / 0.0 ms last resort GC in old space requested
[19270:0x2a732c0] 105843773 ms: Mark-sweep 1293.2 (1416.7) -> 1293.2 (1416.7) MB, 654.9 / 0.0 ms last resort GC in old space requested
Still running, but one finding: I started it about 11am, and it was at over 6 million articles (including all articles--that's under 2.5 million inserted in the database) when I left work at 6:00pm, with articles streaming by faster than I could read the titles. When I got home at 7:00, I glanced at it and it seemed to be moving slowly, and three hours later, the articles were scrolling past noticeably more slowly, and it still hadn't hit 7 million. I haven't been monitoring it the whole time, but it seems like it might have slowed down suddenly, rather than the gradual slowing that would have been the case if the database indexing were to blame.
I apparently forgot to hit the "Comment" button and post the above, so I'll add this update: as of 2:00pm, I'm at 7.95 million, which means that it's proceeding at about the same speed it did before the "skip" options were added. I assume that it's going to crash at around 10 million, since that's what it just did for @e501, and for both of us the last time through.
I'm mystified that adding the "skip" options didn't speed it up, because that implies that there's something about the JavaScript app itself that makes it go slower for the later articles. I wonder if whatever it is is related to the crashes?
hey cool, yeah we're getting pretty close. Also, it seems pretty likely there is a memory leak somewhere. it would be great if we could find the memory leak. There are some tools for identifying this in node applications.
It's also good news that it's not a mongo-related slowdown. That db can handle a lot more action than this, i bet.
maybe it's a good idea to add more --skip flags, to pass-through to wtf_wikipedia. If you're not, say, using the citations, or images, we could skip those steps in the parser.
but yeah, the --max-old-space-size stuff could also get to the end. That's worth a shot. I'll try that mem-leak tool today, if i can find some time.
oh hey, didn't someone say they tried adding a condition in the parser to skip the first-n pages, but it still collapsed? I mean, it would be interesting to skip the first 4m articles, and see if you can get the last-half done on a second pass. If anything, that would isolate our mem-leak a little.
ok, cheers!
ah, it's not a memory-leak. just tested it.
I think i know what's happening.
for each page, the xml-parser isn't waiting for the mongo insert to complete before continuing.
I think it's just building-up pending mongo-inserts, after a long time. That would make sense.
the good news is that means you guys should be able skip-n articles and complete the remaining ones, without any memory problems. I'll implement a skip-n flag now.
if you tell it to skip the first 9 million, it won't insert anything into the database until the 9000001th article, so you'll have to wait.
if you want a log, you can just put a console.log in.
Ah....Yeah, I guess there's no way around that, given the size of the file. We'll see if that speeds things up. I'm not sure we need the pauses in the early stages, though--I would imagine the crashes happen when the MongoDB index gets to a certain size, which slows things down too much.
BTW, why is the number of pages different in each batch? Does it grab a fixed number of lines or something like that?
I modified the setInterval function to avoid the pause if "i" is under 8,000,000, then to start at 1s and increase linearly from there (since the insertion time is probably O(N), with N being the index or database size). It's working, but it occurs to me that, even with a 3s pause every 30s, the thing shouldn't be proceeding much more slowly than the old version, and some quick calculations suggest it isn't--I guess it just seems slower without the article titles whizzing by.
BTW, I can't get the "--verbose" option to work, though, looking at the code (with my zero knowledge of JavaScript), I can't see why not.
Since the impact is minor (less than 10% of running time), I'm going to wait until I'm sure it works before I push it and submit a pull request. Right now, we're at 7.1 million articles.
The workerpool library looks like a good idea, but I'm still wondering what the limiting factor actually is. The current speed is not that different from the original version, even without any parsing, which suggests that the limiting factor is the disk read. That wouldn't be in the hard sense of physical I/O, but rather a matter of the overhead associated with iterating through the articles. I'm not sure if it's relevant to this or not, but I just timed the interval between messages from "logger", and it's more like 7s rather than the 5s specified in the script. I wonder if the "setInterval" loops are themselves slowing things up?
That leaves open the question of why things slow down at around 6 million articles, and maybe it has something to do with the expanding queue as MongoDB slows down with larger and larger indices? If that's the case, the workerpool thing might not help much.
Or maybe the key is to put multiple threads on the MongoDB insertion, rather than on the parser?
I'm close to 9.3 million this morning, so not inserting new records yet, but the modified delay code is working so far, so I'll push it and send you a pull request in a bit here.
OK, something's gone wonky. As I mentioned above, in the last run-through, I got to 9,906,759 articles. Once you added the skip option, I restarted and told it to skip the first 9,906,000. It passed those numbers some time yesterday, and it's over 10.1 million now--but every single insert attempt is still failing with a "duplicate key" error. I'm trying to understand how that could happen: is it possible the count now proceeds differently than it did before?
BTW, the "--verbose" option is in fact working--when I mentioned it before, I didn't realize the article names are triggered only for articles that are parsed. It would be nice, though, to return the article number in each line, because the "logger" summary output gets buried in all the article names.
I think I found where the count is going off: in the first commit of Dec. 4, the check of the "ns" (namespace) field got moved from "index.js" to "parse.js"; one consequence of that move is that it now occurs after the increment of "i" (the article count variable) rather than before.
You'd think everything in the dump would have a namespace of "0" (actual content), but evidently that's not the case.
Anyway, a more significant problem: once the app started parsing articles rather than skipping them things slowed to a crawl with only 500K or so articles processed in the last ~24 hours. I wonder if the insertion errors are slowing things down?
Nearly two days now since we finished skipping articles, and still no new ones: we're at 11.2 million now.
I wonder what the heck all the non-"0" records in the dump are? Given the differences in numbers, there are at least 1.5 million records that weren't counted last time, or greater than 1 in every 10 records. And yet I don't see any titles scrolling by that don't look like article titles.
EDIT: Duh...the titles of the non-"0" pages don't get printed in the log. But they do contribute to the count.
It crashed some time this morning, at the same article as last time, "List of United States counties and county equivalents", which is 12.7 million-something under the new numbering. IIRC, the way I had it programmed when it started, that the delay would have been a little over 1.5s at that point.
The probability that it crashed on exactly the same article by chance is vanishingly small, though I can't explain why it would have crashed on another article during my first run.
wow. yeah that's something. thanks for sharing this.
I can't figure-out why that page triggers any problem. i tested it a bunch of ways just now
i'm pretty busy the next few days. lemme know if you figure something out.
Perhaps you're at the end? perhaps that page is bad-xml?
just now i opened the ./tests/smallwiki and added it to the end, zipped it back up, and run the tests, and it worked fine.
sorry i can't help any further.
The one thing I do know is that I'm not at the end--it's only got 4.1 million records in the database, and there are supposed to be over 5.5 million articles in the English Wikipedia. Plus, in watching the articles go buy, I've noticed they seem to be in the order they were originally inserted in the wiki--and the crash appears to occur during the 2013 articles.
I don't know enough about JavaScript to have any good ideas on this end :(.
what is the error when it crashes? is it still the process out of memory one?
if so, i recommend just splitting the xml file this looks like it can split bz2 compressed xml file.
hey, if it's actually crashing on this specific place each time, and there's some kindof queue-backup, it could definetly be a few articles forward to backward from the counties-or-equivalents page.
if you can find a page that throws an error, that'd be an easy fix presumably.
I think if it were a queue problem, it would crash on a different article once the delay has been introduced, unless (as I think you're implying) the back-up occurs very quickly, between the last pause and the parsing of the article whose title shows up when the crash occurs--and that in turn implies that the problematic article is one that's parsed after that last pause.
It's a pain to test, though, since, even with that tool you linked, it'll take a while to split the file.
Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.
There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).
I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.