Great progress ! ... console is outputting "Error: Cannot find module '../../infobox/infobox' ..."

e501 commented 6 years ago

Glad to see such great progress !

I downloaded the latest english Wikipedia dump (enwiki-latest-pages-articles.xml.bz2), as documented in the readme. Extracted via the archive manager (OS: Ubuntu 16.04).

Now loading into mongodb via command line (dumpster ./enwiki-latest-pages-articles.xml)

As the script is running, an error message is repeatedly displayed: "Error: Cannot find module '../../infobox/infobox' ..."

Within mongo shell, the enwiki database is not yet visible.

Need to know if I should best terminate terminate this run (e.g. cntrl-c) and restart with the infobox option enabled. Ideally, I would like to have the infobox pages with the other articles in mongo.

Greatly appreciate the continued progress with enabling this type of capability !

e501 commented 6 years ago

Sample output from the terminal console, with last line due to program termination (i.e. cntrl-c):

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) module.js:544 throw err; ^

Error: Cannot find module '../../infobox/infobox' at Function.Module._resolveFilename (module.js:542:15) at Function.Module._load (module.js:472:25) at Module.require (module.js:585:17) at require (internal/module.js:11:18) at Object. (/usr/local/lib/node_modules/dumpster-dive/node_modules/wtf_wikipedia/src/document/templates/index.js:6:17) at Module._compile (module.js:641:30) at Object.Module._extensions..js (module.js:652:10) at Module.load (module.js:560:32) at tryModuleLoad (module.js:503:12) at Function.Module._load (module.js:495:3) ^C one sec, cleaning-up the workers...

spencermountain commented 6 years ago

hey thanks e501, which node version are you running? I suspect you'll need node 6+, I should mention that in the docs. if that's not the issue, I'll try to reproduce it

e501 commented 6 years ago

When I checked my mongodb configuration, I decided to upgrade to the default 3.6 configuration for ubuntu 16.04. Thus, I completely uninstalled the older version and did a fresh reinstall with 3.6 community version of mongodb.

Unfortunately, I reproduced the same errors as before. The following is the output for the log file file for the fresh install of mongodb:

2018-04-28T10:29:20.847-0700 I STORAGE [initandlisten] createCollection: admin.system.version with provided UUID: 5db1afbd-0191-49f2-b000-a51d94e85c04 2018-04-28T10:29:21.115-0700 I COMMAND [initandlisten] setting featureCompatibilityVersion to 3.6 2018-04-28T10:29:21.117-0700 I STORAGE [initandlisten] createCollection: local.startup_log with generated UUID: d31624be-d30c-450a-9e2c-9ddb6ea8cabd 2018-04-28T10:29:21.243-0700 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/diagnostic.data' 2018-04-28T10:29:21.243-0700 I NETWORK [initandlisten] waiting for connections on port 27017 2018-04-28T10:34:21.243-0700 I STORAGE [thread1] createCollection: config.system.sessions with generated UUID: 80bf0be9-7471-4f42-9e78-303babd3b85d 2018-04-28T10:34:21.513-0700 I INDEX [thread1] build index on: config.system.sessions properties: { v: 2, key: { lastUse: 1 }, name: "lsidTTLIndex", ns: "config.system.sessions", expireAfterSeconds: 1800 } 2018-04-28T10:34:21.513-0700 I INDEX [thread1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2018-04-28T10:34:21.514-0700 I INDEX [thread1] build index done. scanned 0 total records. 0 secs 2018-04-28T10:34:21.515-0700 I COMMAND [thread1] command config.$cmd command: createIndexes { createIndexes: "system.sessions", indexes: [ { key: { lastUse: 1 }, name: "lsidTTLIndex", expireAfterSeconds: 1800 } ], $db: "config" } numYields:0 reslen:98 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_msg 271ms 2018-04-28T10:38:23.220-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54510 #1 (1 connection now open) 2018-04-28T10:38:23.223-0700 I NETWORK [conn1] received client metadata from 127.0.0.1:54510 conn1: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:38:23.229-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54512 #2 (2 connections now open) 2018-04-28T10:38:23.229-0700 I NETWORK [conn2] received client metadata from 127.0.0.1:54512 conn2: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:39:38.320-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54516 #3 (3 connections now open)

The node version reported by mongodb (platform: "Node.js v9.0.0, , LE, mongodb-core: 2.1.17") is in agreement with what I checked via the command line. Also, looks like your code is making the connections with mongodb.

In the error output to the console, there is a peculiar looking double-dot in the following: "Object.Module._extensions..js (module.js:652:10)". I am new to javascript and so not sure how to best interpret.

If possible, I would like to explore doing a local build of the source and try that as a baseline that would help me learn how to further explore these sorts of issues.

Many thanks for helping track down why the code does not want to run properly.

From an initial swag at a local build, looks like there may be an issue with the loader resolving the "infobox" module path for the wtf_wikipedia module/package.

spencermountain commented 6 years ago

hey, think i figured this ../infoxbox error - it was titlecased, but some operating systems seem to support that.

yeah, mongo 3 is correct. added to the docs.

can you try v3.0.1? cheers

e501 commented 6 years ago

Very nice !

afwiki executed with no problems and loaded the pages in mongodb

When I ran dumpster (v3.0.2) for enwiki, the following is the initial listing of console messages (which included a "JavaScript heap out of memory" error):

         ----------
          oh hi 👋

        total file size: 62.2 GB
        4 cpu cores detected.
        - each worker will be given: 15.5 GB -
         ----------

- wrote 1,000 pages  - 14.2s   (worker #23291)

page: #952 - "Alexander of Greece (disambiguation)"
page: #952 - "Alexander of Greece (disambiguation)"

wrote 2,000 pages - 13.4s (worker #23291) page: #952 - "Alexander of Greece (disambiguation)"
page: #1,896 - "Beatrix Potter"
page: #1,896 - "Beatrix Potter"
wrote 3,000 pages - 15.7s (worker #23291) page: #1,896 - "Beatrix Potter"
page: #2,056 - "Balfour Declaration"

<--- Last few GCs --->

[23296:0x3bb31b0] 49449 ms: Mark-sweep 1200.4 (1434.3) -> 1200.3 (1434.3) MB, 314.3 / 0.1 ms allocation failure GC in old space requested [23296:0x3bb31b0] 49698 ms: Mark-sweep 1200.3 (1434.3) -> 1200.3 (1413.3) MB, 249.1 / 0.0 ms last resort GC in old space requested [23296:0x3bb31b0] 49966 ms: Mark-sweep 1200.3 (1413.3) -> 1200.3 (1407.3) MB, 267.8 / 0.0 ms last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

0: ExitFrame [pc: 0xe810118427d]

Security context: 0x30dfecda06a9 1: _send [internal/child_process.js:701] [bytecode=0x11011ea4e6a1 offset=606](this=0x1f8536184221 ,message=0x2954a760fe99

spencermountain / dumpster-dive

Great progress ! ... console is outputting "Error: Cannot find module '../../infobox/infobox' ..." #36

2 +1,000 pages - 282ms - "Yoo Dong-geun"

2 +1,000 pages - 422ms - "1987 IIHF Asian Oceanic Junior U18 Championship"

3 +1,000 pages - 364ms - "Bang Son Station"

0 +1,000 pages - 573ms - "WNIC"

2 +1,000 pages - 299ms - "Strophanthus eminii"

2 +1,000 pages - 308ms - "John Cogswell"

2 +1 pages - 3ms - "United States presidential election in Georgia, 1988"

0 +1,000 pages - 497ms - "North Zulch, Texas"

1 +1,000 pages - 358ms - "Drakesbad Guest Ranch"

3 +1,000 pages - 378ms - "Gottfried Knoche"

0 +1,000 pages - 358ms - "Dunback and Makareao Branches"

3 +1,000 pages - 378ms - "Trinidad Morgades Besari"

1 +1,000 pages - 447ms - "Tōnohama Station"

0 +1,000 pages - 410ms - "A'isha bint Talhah"

0 +1,000 pages - 338ms - "Signaling System No. 5"

3 +1,000 pages - 355ms - "Benchmark School"

1 +1,000 pages - 318ms - "Eucalyptus × tetragona"

0 +1,000 pages - 339ms - "Fartein Valen"

0 +70 pages - 37ms - "Dimensions (Freedom Call album)"

3 +1,000 pages - 282ms - "Bounty Hunters 2: Hardball"

1 +1,000 pages - 362ms - "List of Doctor Who music releases"

1 +1,000 pages - 338ms - "Lilium maritimum"

1 +1,000 pages - 284ms - "Romanus Nadolney"

1 +856 pages - 231ms - "TGSCOM"

Note that Mongo said the enwiki db was 34.967GB. Also, the following is the console output that provides an estimate of the expected run time: