wp2mongo hangs during article insertion.

MTKnife commented 6 years ago

Every time I run wp2mongo (without the --worker flag) it hangs after extracting 777 articles from the (English) Wikipedia bz. When I re-run it without dropping the the database, I get a bunch of duplicate insertion errors, and, on examining them, it appears that it's trying to insert every article in the last batch except for the last article, on Ibn-al-Haytham, which means the problem must be occurring either during that insertion, or between the Algiers article and the Ibn-al-Haytham article.

There's nothing suspicious happening--no excessive CPU or RAM usage, and I'm not getting errors of any kind (other than the duplicate insertion errors mentioned above).

I've tried the obvious things, rebooting and reinstalling, but without any errors being thrown, I'm at a loss for what else to try.

e501 commented 6 years ago

This also happed to me yesterday.

Over the last few weeks, I have been using parsing results from a dump file that loaded approx. 5.7 articles a few weeks ago. For the articles that were parsed, the results have worked great with mongolite (R package).

Yesterday, when I tried to rerun wp2mongo for the same enwiki file, the code hangs at the 777th article (Ibn al-Haytham), just as described in the posting. I then tried the most recent dump of enwiki and had same result. Rebooting and reinstalling produced same results. Note that the afwiki dump loads into mongodb with no problems.

Also, at the wtf_wikipedia web page (https://spencermountain.github.io/wtf_wikipedia/), the "Algiers" and other previous articles are immediately parsed and displayed, but a fetch for the "Ibn al-Haytham" article is not responsive.

spencermountain commented 6 years ago

agh, weird thanks you two. it looks like there is a bug on the Ibn al-Haytham in wtf_wikipedia. That's got to be it. I will try to fix it today.

spencermountain commented 6 years ago

fwiw, maybe this library should wrap wtf_wikipedia in a try/catch. It would make things slower (i think?) but these sort of errors are probably somewhat inevitable, given the size & nature of wikipedia. any insight or thoughts about doing that?

e501 commented 6 years ago

From previous experience with using json parsers for Wikipedia articles, there has been an issue with extraneous characters and character combinations embedded in the articles. My thinking has been that to guard against confusing the respective parser, a preprocessor may need to validate that the string/blob of characters can be parsed, relative to the known specification and associated constraints/capabilities of the given parser. At this time, I use an ad-hoc collection of regex expressions that strip out potentially-problematic character strings.

Basically, the idea is to have something like a lint process (https://en.wikipedia.org/wiki/Lint_(software)) for flagging suspicious character combinations that may be in the input string (i.e. Wikipedia article). In this case where the parser is hanging (Ibn al-Haytham), there may be an issue with the additional types of characters that are embedded with the more usual English language characters.

spencermountain commented 6 years ago

hey, fixed the this Ibn al-Haytham issue in v2.1.0. can you guys check this out? cheers

e501 commented 6 years ago

Many thanks for fixing the initial problem.

With the new fix, there are new recurring parsing errors, such as the following:

Julia Kristeva 10468 Error: key 0-19-518767-9}}, [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) Juan Miro 10469 Just intonation 10470 Josephus 10471 Jan Borukowski 10472 Judy Blume 10473 Joel Marangella 10474 John Pople 10475 Jake McDuck 10476 Jerry Falwell 10477 Jebus 10478 Jay Leno 10479 Jeroboam II 10480 JTF-CNO 10481 Joan of Arc 10482 Error: key Centre Historique des Archives Nationales]], [[Paris]], AE II 2490, dated to the second half of the 15th century. "The later, fifteenth-century manuscript of [[Charles, Duke of Orléans]] contains a miniature of Joan in armour; the face has certain characteristic features known from her contemporaries' descriptions, and the artist may have worked from indications by someone who had known her." Joan M. Edmunds, The Mission of Joan of Arc (2008), [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) Johannes Nicolaus Brønsted 10483 Janus kinase 10484 Jacob Grimm 10485 Jamiroquai 10486 John Sutter 10487 John Adams (composer) 10488 Jon Voight 10489 Error: key 978-0-306-80900-2}}, p.236. [https://books.google.com/books?id must not contain '.' at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:753:19) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:618:17) at serializeObject (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:308:18) at serializeInto (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/parser/serializer.js:776:17) at BSON.serialize (/usr/local/lib/node_modules/wikipedia-to-mongodb/node_modules/bson/lib/bson/bson.js:58:27) John Climacus 10490 John of the Ladder 10491

At the wtf_wikipedia web page, the parser looks to be returning partial results for the problematic articles, wherein some of the sections look to be missing some content.

MTKnife commented 6 years ago

On the subject of the try/catch block, I did a quick search, and what I found suggests that what's true for Python (with which I'm familiar) is also true of JavaScript (with which I'm not familiar): unless there's very little happening in the try/catch block, the cost is trivial when there's no error. I think, given the size and complexity of Wikipedia, especially when you consider the multiple available languages, that you really should have error-catching, or you're likely to run into this kind of problem again. It would probably also be a good idea to produce an output that indicates any articles that have been skipped.

spencermountain commented 6 years ago

yeah, ok. that convinced me.

oh, @e501 i know what that is. will try to do it today.

thanks for the help!

spencermountain commented 6 years ago

@e501 fixed the encodings for citation keys, and @MTKnife added try/catch statements around the wikipedia parsing stuff. 2.2.0 tests passing. lemme know

MTKnife commented 6 years ago

I'm running it with Redis right now, and I think it's working, but I'm not entirely sure what I'm supposed to be seeing. I'm no longer getting errors (or any feedback whatsoever) when I run "worker.js", and the "wikipedia" collection in the "wikipedia_queue" database is getting steadily larger. However, I've noticed that the database is initiated and insertions start to take place before I run "worker.js", which leads me to assume that "wp2mongo" is inserting unparsed articles in the db. Presumably "worker.js" then goes back and parses those articles and modifies the records in question--so what should I look for to be sure that the records I'm seeing have been parsed?

Whoops...we just got some kind of Redis error. Here's what it looks like:

La Puente, California 71751 La Verne, California 71752 Ladera Heights, California 71753 events.js:183 throw er; // Unhandled 'error' event ^

Error: Redis connection to localhost:6379 failed - read ECONNRESET at _errnoException (util.js:1024:11) at TCP.onread (net.js:615:25)

Since I'm about to leave work, I'm re-running it without the Redis, and so far that's working.

UPDATE: The Redis thing appears to be some kind of issue external to wp2mongo. See this link.

e501 commented 6 years ago

Many thanks for your rapid turnaround on the most recent fixes. This evening, I was able to process up to 5508058 articles from the enwiki-20171120-pages-articles.xml.bz2 dump. Unfortunately, right after the "Love You More (Ginuwine song) 5508058" string was output to the terminal, the following core dump occurred:

Stacktrace: magic1=bbbbbbbb magic2=bbbbbbbb ptr1=0xdc8db802459 ptr2=(nil) ptr3=(nil) ptr4=(nil) ptr5=(nil) ptr6=(nil) ptr7=(nil) ptr8=(nil)

==== JS stack trace =========================================

Security context: 0x38b8e6425729 #0# 2: parse [/usr/local/lib/node_modules/wikipedia-to-mongodb/src/doPage.js:64] [bytecode=0x2b06ad8b2651 offset=50](this=0x11d500ff4251

spencermountain / dumpster-dive

wp2mongo hangs during article insertion. #21

0# 0x38b8e6425729: 0x38b8e6425729

2# 0x2166a5e41a79: 0x2166a5e41a79 <JSFunction (sfi = 0x2b06ad8b0e41)>

3# 0x4639680e1f1: 0x4639680e1f1

5# 0x4639680ef11: 0x4639680ef11 <JSFunction (sfi = 0x2b06ad88d501)>

8# 0x11d53748c0f9: 0x11d53748c0f9

9# 0x4639680e2b9: 0x4639680e2b9

10# 0x22f939cc0011: 0x22f939cc0011

11# 0x2166a5e0a251: 0x2166a5e0a251

12# 0x2a47dc0022b9: 0x2a47dc0022b9

13# 0x22f939cc0271: 0x22f939cc0271

14# 0x22f939cc0359: 0x22f939cc0359

15# 0x2a47dc002479: 0x2a47dc002479

17# 0x11d500ff3569: 0x11d500ff3569 <FixedArray[7]>

18# 0x4639680f091: 0x4639680f091 <FixedArray[8]>

19# 0x2b06ad892bb9: 0x2b06ad892bb9 <JSFunction stream.write (sfi = 0x2b06ad892131)>

20# 0x38b8e643b3d9: 0x38b8e643b3d9 <JSFunction emit (sfi = 0x38b8e6439ee1)>

21# 0x5b65f2b92b1: 0x5b65f2b92b1 <JSFunction addChunk (sfi = 0x153d2b2e0859)>

22# 0x5b65f2b9269: 0x5b65f2b9269 <JSFunction readableAddChunk (sfi = 0x153d2b2e07b1)>

23# 0x153d2b2e4b79: 0x153d2b2e4b79 <JSFunction Readable.push (sfi = 0x153d2b2e1819)>

24# 0x2a47dc0023a1: 0x2a47dc0023a1

25# 0x2a47dc002309: 0x2a47dc002309 <JSFunction onread (sfi = 0x2b06ad897511)>

26# 0x4639685ef51: 0x4639685ef51 <JSFunction parse (sfi = 0x2008bde2cd9)>

27# 0x4639685ef99: 0x4639685ef99 <JSFunction plaintext (sfi = 0x2008bde2d81)>

29# 0x4639680e3c9: 0x4639680e3c9

38# 0x4639680e631: 0x4639680e631

42# 0x22f939ce9ee1: 0x22f939ce9ee1

43# 0x2a47dc0023f1: 0x2a47dc0023f1 <JSFunction wrapper (sfi = 0x2b06ad89a161)>

44# 0x10d0e7fed659: 0x10d0e7fed659 <JSArray[4]>

45# 0x10d0e7fed531: 0x10d0e7fed531 <JSArray[1]>

47# 0x10d0e7fed639: 0x10d0e7fed639 <JSArray[3]>

48# 0x10d0e7fed3c9: 0x10d0e7fed3c9 <JSArray[0]>

49# 0x10d0e7fed3e9: 0x10d0e7fed3e9 <JSArray[0]>

50# 0x10d0e7fed409: 0x10d0e7fed409 <JSArray[2]>

51# 0x11d500ff35b1: 0x11d500ff35b1 <JSFunction (sfi = 0x2008bde2a49)>

52# 0x38b8e6403d41: 0x38b8e6403d41 <FixedArray[282]>

54# 0x11d500ff4339: 0x11d500ff4339 <JSFunction encodeStr (sfi = 0x2008bde2b89)>

55# 0x11d500ff4381: 0x11d500ff4381 <JSFunction encodeData (sfi = 0x2008bde2c31)>

56# 0x22f939cd8799: 0x22f939cd8799 <JSFunction (sfi = 0x3a4f11ea349)>

57# 0x22f939ce6f61: 0x22f939ce6f61 <FixedArray[8]>