spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
242 stars 46 forks source link

Great progress ! ... console is outputting "Error: Cannot find module '../../infobox/infobox' ..." #36

Closed e501 closed 6 years ago

e501 commented 6 years ago

Glad to see such great progress !

I downloaded the latest english Wikipedia dump (enwiki-latest-pages-articles.xml.bz2), as documented in the readme. Extracted via the archive manager (OS: Ubuntu 16.04).

Now loading into mongodb via command line (dumpster ./enwiki-latest-pages-articles.xml)

As the script is running, an error message is repeatedly displayed: "Error: Cannot find module '../../infobox/infobox' ..."

Within mongo shell, the enwiki database is not yet visible.

Need to know if I should best terminate terminate this run (e.g. cntrl-c) and restart with the infobox option enabled. Ideally, I would like to have the infobox pages with the other articles in mongo.

Greatly appreciate the continued progress with enabling this type of capability !

e501 commented 6 years ago

Sample output from the terminal console, with last line due to program termination (i.e. cntrl-c):

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) module.js:544 throw err; ^

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) module.js:544 throw err; ^

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) module.js:544 throw err; ^

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) module.js:544 throw err; ^

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) ^C one sec, cleaning-up the workers...

spencermountain commented 6 years ago

hey thanks e501, which node version are you running? I suspect you'll need node 6+, I should mention that in the docs. if that's not the issue, I'll try to reproduce it

e501 commented 6 years ago

When I checked my mongodb configuration, I decided to upgrade to the default 3.6 configuration for ubuntu 16.04. Thus, I completely uninstalled the older version and did a fresh reinstall with 3.6 community version of mongodb.

Unfortunately, I reproduced the same errors as before. The following is the output for the log file file for the fresh install of mongodb:

2018-04-28T10:29:20.847-0700 I STORAGE [initandlisten] createCollection: admin.system.version with provided UUID: 5db1afbd-0191-49f2-b000-a51d94e85c04 2018-04-28T10:29:21.115-0700 I COMMAND [initandlisten] setting featureCompatibilityVersion to 3.6 2018-04-28T10:29:21.117-0700 I STORAGE [initandlisten] createCollection: local.startup_log with generated UUID: d31624be-d30c-450a-9e2c-9ddb6ea8cabd 2018-04-28T10:29:21.243-0700 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/diagnostic.data' 2018-04-28T10:29:21.243-0700 I NETWORK [initandlisten] waiting for connections on port 27017 2018-04-28T10:34:21.243-0700 I STORAGE [thread1] createCollection: config.system.sessions with generated UUID: 80bf0be9-7471-4f42-9e78-303babd3b85d 2018-04-28T10:34:21.513-0700 I INDEX [thread1] build index on: config.system.sessions properties: { v: 2, key: { lastUse: 1 }, name: "lsidTTLIndex", ns: "config.system.sessions", expireAfterSeconds: 1800 } 2018-04-28T10:34:21.513-0700 I INDEX [thread1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2018-04-28T10:34:21.514-0700 I INDEX [thread1] build index done. scanned 0 total records. 0 secs 2018-04-28T10:34:21.515-0700 I COMMAND [thread1] command config.$cmd command: createIndexes { createIndexes: "system.sessions", indexes: [ { key: { lastUse: 1 }, name: "lsidTTLIndex", expireAfterSeconds: 1800 } ], $db: "config" } numYields:0 reslen:98 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_msg 271ms 2018-04-28T10:38:23.220-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54510 #1 (1 connection now open) 2018-04-28T10:38:23.223-0700 I NETWORK [conn1] received client metadata from 127.0.0.1:54510 conn1: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:38:23.229-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54512 #2 (2 connections now open) 2018-04-28T10:38:23.229-0700 I NETWORK [conn2] received client metadata from 127.0.0.1:54512 conn2: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:39:38.320-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54516 #3 (3 connections now open)

The node version reported by mongodb (platform: "Node.js v9.0.0, , LE, mongodb-core: 2.1.17") is in agreement with what I checked via the command line. Also, looks like your code is making the connections with mongodb.

In the error output to the console, there is a peculiar looking double-dot in the following: "Object.Module._extensions..js (module.js:652:10)". I am new to javascript and so not sure how to best interpret.

If possible, I would like to explore doing a local build of the source and try that as a baseline that would help me learn how to further explore these sorts of issues.

Many thanks for helping track down why the code does not want to run properly.

From an initial swag at a local build, looks like there may be an issue with the loader resolving the "infobox" module path for the wtf_wikipedia module/package.

spencermountain commented 6 years ago

hey, think i figured this ../infoxbox error - it was titlecased, but some operating systems seem to support that.

yeah, mongo 3 is correct. added to the docs.

can you try v3.0.1? cheers

e501 commented 6 years ago

Very nice !

afwiki executed with no problems and loaded the pages in mongodb

When I ran dumpster (v3.0.2) for enwiki, the following is the initial listing of console messages (which included a "JavaScript heap out of memory" error):

         ----------
          oh hi πŸ‘‹

        total file size: 62.2 GB
        4 cpu cores detected.
        - each worker will be given: 15.5 GB -
         ----------

- wrote 1,000 pages  - 14.2s   (worker #23291)

page: #952 - "Alexander of Greece (disambiguation)"
page: #952 - "Alexander of Greece (disambiguation)"

<--- Last few GCs --->

[23296:0x3bb31b0] 49449 ms: Mark-sweep 1200.4 (1434.3) -> 1200.3 (1434.3) MB, 314.3 / 0.1 ms allocation failure GC in old space requested [23296:0x3bb31b0] 49698 ms: Mark-sweep 1200.3 (1434.3) -> 1200.3 (1413.3) MB, 249.1 / 0.0 ms last resort GC in old space requested [23296:0x3bb31b0] 49966 ms: Mark-sweep 1200.3 (1413.3) -> 1200.3 (1407.3) MB, 267.8 / 0.0 ms last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

0: ExitFrame [pc: 0xe810118427d]

Security context: 0x30dfecda06a9 1: _send [internal/child_process.js:701] [bytecode=0x11011ea4e6a1 offset=606](this=0x1f8536184221 ,message=0x2954a760fe99 ,handle=0x34f4dc0822e1 ,options=0x2954a760ff41 ,callback=0x34f4dc0822e1 ) 2: send [internal/child_process.js:611] [bytecode...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 1: node::Abort() [/usr/local/bin/node] 2: 0x87b56c [/usr/local/bin/node] 3: v8::Utils::ReportOOMFailure(char const, bool) [/usr/local/bin/node] 4: v8::internal::V8::FatalProcessOutOfMemory(char const, bool) [/usr/local/bin/node] 5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/usr/local/bin/node] 6: v8::internal::String::SlowFlatten(v8::internal::Handle, v8::internal::PretenureFlag) [/usr/local/bin/node] 7: v8::internal::String::Flatten(v8::internal::Handle, v8::internal::PretenureFlag) [/usr/local/bin/node] 8: v8::String::WriteUtf8(char, int, int, int) const [/usr/local/bin/node] 9: node::StringBytes::Write(v8::Isolate, char, unsigned long, v8::Local, node::encoding, int) [/usr/local/bin/node] 10: int node::StreamBase::WriteString<(node::encoding)1>(v8::FunctionCallbackInfo const&) [/usr/local/bin/node] 11: void node::StreamBase::JSMethod<node::LibuvStreamWrap, &(int node::StreamBase::WriteString<(node::encoding)1>(v8::FunctionCallbackInfo const&))>(v8::FunctionCallbackInfo const&) [/usr/local/bin/node] 12: v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) [/usr/local/bin/node] 13: 0xad62fa [/usr/local/bin/node] 14: v8::internal::Builtin_HandleApiCall(int, v8::internal::Object*, v8::internal::Isolate) [/usr/local/bin/node] 15: 0xe810118427d page: #2,826 - "Calvin Coolidge"
page: #2,826 - "Calvin Coolidge"

  • wrote 4,000 pages - 14.3s (worker #23291) page: #3,746 - "Dartmoor wildlife"
  • wrote 5,000 pages - 11.6s (worker #23291)
e501 commented 6 years ago

... over a million pages and mongo reports "enwiki 9.845GB"

At the console, after reporting that "a worker has finished", the same message keeps getting printed, as shown in the following:

page: #1,188,005 - "Steelyard balance"

  • wrote 1,251,000 pages - 2.1s (worker #23291)
  • wrote 1,251,113 pages - 0.2s (worker #23291)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 3 workers still running -

    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"
    page: #1,189,073 - "The History Mix Volume 1"

e501 commented 6 years ago

Terminated the initial run (as discussed previously) and re-ran the code. Note that Wikipedia reports a total of 5,631,118 content pages (https://en.wikipedia.org/wiki/Special:Statistics)

The code ran to completion with the following console messages:

- wrote 1,251,000 pages  - 0.1s   (worker #24376)
- wrote 1,251,113 pages  - 0.3s   (worker #24376)
- wrote 1,251,113 pages  - 0.0s   (worker #24376)

πŸ’ͺ  a worker has finished πŸ’ͺ 
  - 3 workers still running -

πŸ’ͺ  a worker has finished πŸ’ͺ 
  - 2 workers still running -

- wrote 1,242,000 pages  - 4.8s   (worker #24364)
- wrote 1,242,000 pages  - 0.1s   (worker #24364)

page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,243,000 pages - 4.2s (worker #24364)
  • wrote 1,243,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,244,000 pages - 4.0s (worker #24364)
  • wrote 1,244,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,245,000 pages - 4.0s (worker #24364)
  • wrote 1,245,000 pages - 0.1s (worker #24364)
  • wrote 1,246,000 pages - 3.7s (worker #24364)
  • wrote 1,246,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,247,000 pages - 3.9s (worker #24364)
  • wrote 1,247,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,248,000 pages - 3.6s (worker #24364)
  • wrote 1,248,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,249,000 pages - 4.1s (worker #24364)
  • wrote 1,249,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,250,000 pages - 4.0s (worker #24364)
  • wrote 1,250,000 pages - 0.1s (worker #24364) page: #1,189,073 - "The History Mix Volume 1"
  • wrote 1,251,000 pages - 4.0s (worker #24364)
  • wrote 1,251,000 pages - 0.1s (worker #24364)
  • wrote 1,251,113 pages - 0.3s (worker #24364)
  • wrote 1,251,113 pages - 0.0s (worker #24364)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 1 workers still running -

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 0 workers still running -

    page: #1,189,073 - "The History Mix Volume 1"

    πŸ‘ closing down.

    -- final count is 1,189,073 pages -- took 2.6 hours πŸŽ‰

e501 commented 6 years ago

From some web searches, similar "JavaScript heap out of memory" errors were resolved by increasing the "--max_old_space_size".

Thus, I dropped/deleted the enwiki database in mongodb and reran dumpster with the following command line: "node --max_old_space_size=4000000 /usr/local/lib/node_modules/dumpster-dive/bin/dumpster ./enwiki-latest-pages-articles.xml"

The code did not report any error messages to the console and finished up with the following:

page: #1,183,402 - "Jonathan Marray"

  • wrote 1,236,000 pages - 4.1s (worker #26756)

  • wrote 1,236,000 pages - 3.7s (worker #26739)

  • wrote 1,225,000 pages - 2.5s (worker #26749)

  • wrote 1,248,000 pages - 2.5s (worker #26740)

  • wrote 1,237,000 pages - 3.5s (worker #26756)

  • wrote 1,237,000 pages - 2.9s (worker #26739)

  • wrote 1,249,000 pages - 2.8s (worker #26740)

  • wrote 1,226,000 pages - 3.2s (worker #26749)

  • wrote 1,238,000 pages - 3.2s (worker #26756)

  • wrote 1,250,000 pages - 2.8s (worker #26740) page: #1,187,042 - "Peermade"

  • wrote 1,238,000 pages - 3.1s (worker #26739)

  • wrote 1,227,000 pages - 2.7s (worker #26749)

  • wrote 1,251,000 pages - 2.5s (worker #26740)

  • wrote 1,228,000 pages - 2.3s (worker #26749)

  • wrote 1,239,000 pages - 3.6s (worker #26756)

  • wrote 1,251,113 pages - 0.2s (worker #26740)

  • wrote 1,239,000 pages - 2.7s (worker #26739)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 3 workers still running -
  • wrote 1,229,000 pages - 1.8s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,240,000 pages - 2.1s (worker #26756)

  • wrote 1,240,000 pages - 2.1s (worker #26739)

  • wrote 1,230,000 pages - 1.8s (worker #26749)

  • wrote 1,241,000 pages - 2.4s (worker #26756)

  • wrote 1,241,000 pages - 2.5s (worker #26739)

  • wrote 1,231,000 pages - 2.2s (worker #26749)

  • wrote 1,242,000 pages - 2.9s (worker #26756)

  • wrote 1,242,000 pages - 2.8s (worker #26739)

  • wrote 1,232,000 pages - 2.2s (worker #26749)

  • wrote 1,243,000 pages - 2.4s (worker #26756)

  • wrote 1,243,000 pages - 2.5s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,233,000 pages - 2.9s (worker #26749)

  • wrote 1,244,000 pages - 2.3s (worker #26756)

  • wrote 1,244,000 pages - 2.3s (worker #26739)

  • wrote 1,234,000 pages - 2.6s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,245,000 pages - 2.4s (worker #26756)

  • wrote 1,245,000 pages - 2.8s (worker #26739)

  • wrote 1,235,000 pages - 2.7s (worker #26749)

  • wrote 1,246,000 pages - 2.0s (worker #26756)

  • wrote 1,246,000 pages - 2.1s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,247,000 pages - 2.3s (worker #26756)

  • wrote 1,236,000 pages - 2.9s (worker #26749)

  • wrote 1,247,000 pages - 2.4s (worker #26739)

  • wrote 1,248,000 pages - 2.0s (worker #26756)

  • wrote 1,237,000 pages - 2.6s (worker #26749)

  • wrote 1,248,000 pages - 2.3s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,249,000 pages - 2.7s (worker #26756)

  • wrote 1,238,000 pages - 2.6s (worker #26749)

  • wrote 1,249,000 pages - 2.6s (worker #26739)

  • wrote 1,250,000 pages - 2.3s (worker #26756)

  • wrote 1,239,000 pages - 2.2s (worker #26749)

  • wrote 1,250,000 pages - 2.5s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,240,000 pages - 2.0s (worker #26749)

  • wrote 1,251,000 pages - 2.7s (worker #26756)

  • wrote 1,251,113 pages - 0.2s (worker #26756)

  • wrote 1,251,000 pages - 2.2s (worker #26739)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 2 workers still running -
  • wrote 1,251,113 pages - 0.2s (worker #26739)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 1 workers still running -
  • wrote 1,241,000 pages - 2.2s (worker #26749)

  • wrote 1,242,000 pages - 2.3s (worker #26749)

  • wrote 1,243,000 pages - 2.0s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,244,000 pages - 2.0s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,245,000 pages - 2.0s (worker #26749)

  • wrote 1,246,000 pages - 1.9s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,247,000 pages - 1.9s (worker #26749)

  • wrote 1,248,000 pages - 1.8s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,249,000 pages - 2.1s (worker #26749)

  • wrote 1,250,000 pages - 1.9s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"

  • wrote 1,251,000 pages - 2.0s (worker #26749)

  • wrote 1,251,113 pages - 0.2s (worker #26749)

    πŸ’ͺ a worker has finished πŸ’ͺ

    • 0 workers still running -

      πŸ‘ closing down.

      -- final count is 1,189,073 pages -- took 111.7 minutes πŸŽ‰

spencermountain commented 6 years ago

hey, thanks. That's really helpful with the --max_old_space_size arg.

Yeah, that's for en-wikipedia? Something is off then, the count should be a lot closer to ~4.5m. A friend of mine ran it yesterday and had something similar - i think there's some errors being buried somewhere.

i'll try to pick-through it tomorrow. Think it's probably something about connections to mongo timing out, or something like that.

will ping you when i have something. thanks

e501 commented 6 years ago

Thanks for feedback ... yea, Wikipedia stats reports a total of 5,631,118 content pages (https://en.wikipedia.org/wiki/Special:Statistics)

While working through the details for why we are not getting all the pages loaded, I would like to explore how to best leverage the structure that you have implemented for the parsed pages. From the pages that I have revisited, there seems to be some baseline procedures/processes that can provide more NLP oriented statistical summaries (e.g. sentence length, number of sentences per section).

Any feedback/thoughts that you may have, are also of interest.

MTKnife commented 6 years ago

I agree that could be useful, though I'd make it optional if it has any performance impact at all.

MTKnife commented 6 years ago

BTW, I was going to try out the Redis option again, but I don't see the instructions for invoking it in the current readme. How do I do it?

MTKnife commented 6 years ago

It's working so far, but it's still demanding 8 "cores" on a 4-core machine, which makes getting anything else done at that machine very difficult. Is there a way yet to specify the number of cores/threads?

MTKnife commented 6 years ago

Whoops....It's been running for a couple of hours now, but I just realized it hasn't put a thing in the database--nor has it thrown an error.

MTKnife commented 6 years ago

OK, here's what happened in my case:

The app started up 7 workers. 6 of those finished, each having processed close to 800,000 articles--that doesn't add up to quite the expected 5.6 million, but it's in the ballpark. However, the 7th worker, which was lagging the whole time, simply froze a bit short of 500,000 articles, and that was all she wrote. In addition, as I pointed out above, nothing ever got added to the database; in fact, the database was never even created.

spencermountain commented 6 years ago

thanks @MTKnife - it didn't write anything to the database?? the logic around database names and collection names has changed, can you double-check that?

show dbs
use whatever
show collections

yeah, about that trailing worker, that sounds familiar. I'm gonna update the mongo lib and and some logging, so we can catch this bad-boy. I think probably the connection is just timing-out and needs to be restarted

spencermountain commented 6 years ago

@MTKnife did your output look roughly like this? image

MTKnife commented 6 years ago

No, actually, that's another thing...I never got the article titles, and I never got anything else after that, either--just the initial (wrong) detection of number of cores (it detects threads, not cores), followed by page number updates. I never got the "a worker has finished" message, despite the fact that 6 workers finished.

As for the database, my instance never showed anything but "admin" and "local". I can try it again tomorrow to see if a restart does anything, but the server was responding just fine to commands in the terminal window.

MTKnife commented 6 years ago

It occurs to me that the behavior I saw (workers getting almost but not quite to the expected number of articles, and not seeing any "finished" messages) probably indicates that all 7 workers froze at different points.

spencermountain commented 6 years ago

updated to latest mondodb drivers and improved the logging, in 3.0.3. you'll need to npm install maybe to try it.

MTKnife commented 6 years ago

I'm about to try it now, as soon as I can get MongoDB upgraded from 3.4 to 3.6.

BTW, yes, I think an npm install is always mandatory on a Windows system, unless you create a symlink from the user directory where npm deposits its copy to the local git repo.

EDIT: Actually, while the symlink ("directory junction" using mklink in Windows worked before, I can't get it to work now, so yes, an npm install will always be necessary.

MTKnife commented 6 years ago

OK, just tried running it, and wow, that was ugly:

$ dumpster C:/data/Wikipedia/enwiki-20180420-pages-articles.xml filesize: 66771313462

         ----------
          oh hi πŸ‘‹

        total file size: 62.2 GB
        8 cpu cores detected.
        - each worker will be given: 7.8 GB -
         ----------

C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62 console.log(chalk.green(' -skipping page: "' + pageObj.title + '"')) ^

TypeError: Cannot read property 'title' of null at donePage (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62:64) at parseLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\01-parseLine.js:20:7) at LineByLineReader.lr.on (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:77:13) at emitOne (events.js:116:13) at LineByLineReader.emit (events.js:211:7) at LineByLineReader._nextLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:129:8) at Immediate._onImmediate (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:134:9) at runCallback (timers.js:794:20) at tryOnImmediate (timers.js:752:5) at processImmediate [as _immediateCallback] (timers.js:729:5) C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62 console.log(chalk.green(' -skipping page: "' + pageObj.title + '"')) ^

TypeError: Cannot read property 'title' of null at donePage (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62:64) at parseLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\01-parseLine.js:20:7) at LineByLineReader.lr.on (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:77:13) at emitOne (events.js:116:13) at LineByLineReader.emit (events.js:211:7) at LineByLineReader._nextLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:129:8) at Immediate._onImmediate (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:134:9) at runCallback (timers.js:794:20) at tryOnImmediate (timers.js:752:5) at processImmediate [as _immediateCallback] (timers.js:729:5) --uncaught process error-- { ProcessTerminatedError: cancel after 0 retries! at tasks.filter.forEach.task (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:111:39) at Array.forEach () at WorkerNodes.handleWorkerExit (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:110:14) at Worker.worker.on.exitCode (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:160:44) at emitOne (events.js:116:13) at Worker.emit (events.js:211:7) at WorkerProcess.Worker.process.once.code (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\worker.js:39:18) at Object.onceWrapper (events.js:315:30) at emitOne (events.js:116:13) at WorkerProcess.emit (events.js:211:7) name: 'ProcessTerminatedError', message: 'cancel after 0 retries!' } C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62 console.log(chalk.green(' -skipping page: "' + pageObj.title + '"')) ^

After that, there's more of the same, but it stops within a few seconds.

Still no database being added in MongoDB, BTW.

EDIT: While the error messages stopped, and the program didn't actually issue any more output, I did still have to ctrl-C to get it to exit.

spencermountain commented 6 years ago

hey folks, few reports of people finally getting to the end without errors on 3.1.0 πŸ’ƒ try it out?

been refactoring a lot of the template work in wtf_wikipedia, so expect some changes with that formatting in the next few releases

e501 commented 6 years ago

Many thanks for the update. Downloaded the most recent dump of enwiki last night and now running dumpster. A couple "duplicate entries" scrolled by while watching the console. Also, the following error also scrolled by:

2 +1,000 pages - 282ms - "Yoo Dong-geun"

---Error on "Ikeje Asogwa" TypeError: Cannot read property 'replace' of undefined

As I was typing this, the following message indicated that one of the workers finished: current: 3,261,461 pages - "Swimming at the 2013 World Aquatics Championships – Men's 50 metre breaststroke"

2 +1,000 pages - 422ms - "1987 IIHF Asian Oceanic Junior U18 Championship"

3 +1,000 pages - 364ms - "Bang Son Station"

0 +1,000 pages - 573ms - "WNIC"

2 +1,000 pages - 299ms - "Strophanthus eminii"

2 +1,000 pages - 308ms - "John Cogswell"

2 +1 pages - 3ms - "United States presidential election in Georgia, 1988"

πŸ’ͺ  worker #2 has finished πŸ’ͺ 
  - 3 workers still running -

0 +1,000 pages - 497ms - "North Zulch, Texas"

1 +1,000 pages - 358ms - "Drakesbad Guest Ranch"

3 +1,000 pages - 378ms - "Gottfried Knoche"

 current: 3,268,998 pages  - "Rosalie Woodruff"     

... the system monitor confirms that only three workers are still active. Seems that this may be a premature termination of the worker thread/process.

On a separate, yet related topic, I would like to track down more details and learn more about the practical issues/challenges for parsing of Wikipedia pages that are based on categories, lists, and navigation templates (https://en.wikipedia.org/wiki/Wikipedia:Categories,_lists,_and_navigation_templates#Navigation_templates). Of particular interest, are navbox and sidebar navigation-templates (https://en.wikipedia.org/wiki/Wikipedia:Navigation_template).

Looking forward to working with a more complete set of Wikipedia pages.

Later today, I'll be sure to post an update.

Thanks again !

e501 commented 6 years ago

Another error reported to console ... respective message context follows: current: 4,745,998 pages - "Thomas Stelzer"

0 +1,000 pages - 358ms - "Dunback and Makareao Branches"

3 +1,000 pages - 378ms - "Trinidad Morgades Besari"

1 +1,000 pages - 447ms - "Tōnohama Station"

0 +1,000 pages - 410ms - "A'isha bint Talhah"

---Error on "Gustavo Isaza MejΓ­a" TypeError: Cannot read property 'replace' of undefined

Another worker finished ... respective console messages follow: current: 4,989,998 pages - "Guitar Rock Tour"

0 +1,000 pages - 338ms - "Signaling System No. 5"

3 +1,000 pages - 355ms - "Benchmark School"

1 +1,000 pages - 318ms - "Eucalyptus Γ— tetragona"

0 +1,000 pages - 339ms - "Fartein Valen"

0 +70 pages - 37ms - "Dimensions (Freedom Call album)"

πŸ’ͺ  worker #0 has finished πŸ’ͺ 
  - 2 workers still running -

3 +1,000 pages - 282ms - "Bounty Hunters 2: Hardball"

1 +1,000 pages - 362ms - "List of Doctor Who music releases"

 current: 4,996,068 pages  - "Malato"     

Note that I opted NOT to skip the disambiguation and redirect pages.

spencermountain commented 6 years ago

hey thanks e501, that's great!

by any chance did you get line number for that Ikeje Asogwa or Gustavo Isaza MejΓ­a errors? I'll try to hunt that down today.

Yeah, there is a great deal of difference between the progress of the workers, sometimes 20% or so, having one worker finish before the others is a drag, and something I'm currently thinking about how to improve on. Any ideas on this are welcomed. If we're expecting 5m pages or so, it sounds reasonable that one will finish early at 3.2m.

Yeah, the template world in en-wikipedia alone is pretty crazy. I'd like to see a version of dumpster roll-through the navboxes too, but right now it's sticking with pages in the 0 namespace. You can see the parsing logic in wtf_wikipedia coming along here, but it's not very clever or well-done right now.

Parsing the navboxes would be cool, and doable with that code. Dumpster-dive would need to pay-attention to namespace 10. It's a cool idea

spencermountain commented 6 years ago

oh hey, just found that error. nevermind.

e501 commented 6 years ago

Wow ... only took a little over two hours to load in the rest of the pages

 current: 5,650,035 pages  - "Charles John Talbot"     

1 +1,000 pages - 338ms - "Lilium maritimum"

1 +1,000 pages - 284ms - "Romanus Nadolney"

1 +856 pages - 231ms - "TGSCOM"

πŸ’ͺ  worker #1 has finished πŸ’ͺ 
  - 0 workers still running -

  πŸ‘  closing down.

 -- final count is 5,652,891 pages --
   took 2.2 hours
          πŸŽ‰
spencermountain commented 6 years ago

you did en-wikipedia in 2 hours?! how??! What's your computer like?!!

e501 commented 6 years ago

In hindsight, I think that I still had the pages from the previous runs of the previous versions that terminated early ... I forgot to check and verify that I deleted the enwiki from previous runs.

Since I downloaded the most recent version of enwiki, I am deleted the db in mongodb, and re-running the code.

Note that Mongo said the enwiki db was 34.967GB. Also, the following is the console output that provides an estimate of the expected run time:

              oh hi πŸ‘‹
     size:              62.7 GB
                      4 workers
                   15.7 GB each
     estimate:          4.6 hrs
     ---------------------------
e501 commented 6 years ago

Hmmm ...

The code ran for about the same amount of time ...

  πŸ‘  closing down.

 -- final count is 5,652,891 pages --
   took 2.2 hours
          πŸŽ‰

... but the enwiki db is only 27.090GB

e501 commented 6 years ago

May be that the recent server upgrade to i7, DDR4 RAM, and RAID array helps speed up the process. Currently re-running the dumpster code ... as expected/hoped, the duplicate pages are recognized and no new pages are be entered/added.

On a separate note, for the mongodb entries, I am not sure how to identify which sentences go with which paragraphs with a respective section. Ideally, I would like to be able to make this next level of distinction down to the paragraph resolution. Please let me know if there may be some additional metadata for further segmenting the sentences.

Lots of fun !

spencermountain commented 6 years ago

;) yeah! paragraphs are not something we currently support, but should!

yeah, you're the current record-keeper for parsing speed. I wanna try scripting ec2 stuff, to make it a one-liner to run this on a big machine. If you have any experience with this, I'd love some help.

MTKnife commented 6 years ago

It's likely gonna be a few days before I get a chance to try it again, but I'm looking forward to it.

e501 commented 6 years ago

More of a browser-based (e.g. liquid-computing) type of approach might be more interesting and enable better utilization of a broad range of personal (and shared) web-enabled computational devices (e.g. desktops/workstations, laptops, phones) [1]. This may also address some of the data-streaming aspects of the task and enable more of a dataflow-oriented model of computation.

Look forward to your feedback/comments and suggestions.

REFS [1] Pham, Quoc-Viet, et al. "Decentralized Computation Offloading and Resource Allocation in Heterogeneous Networks with Mobile Edge Computing." arXiv preprint arXiv:1803.00683 (2018) (https://arxiv.org/pdf/1803.00683)