Closed e501 closed 6 years ago
Sample output from the terminal console, with last line due to program termination (i.e. cntrl-c):
Error: Cannot find module '../../infobox/infobox'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object.
Error: Cannot find module '../../infobox/infobox'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object.
Error: Cannot find module '../../infobox/infobox'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object.
Error: Cannot find module '../../infobox/infobox'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object.
Error: Cannot find module '../../infobox/infobox'
at Function.Module._resolveFilename (module.js:542:15)
at Function.Module._load (module.js:472:25)
at Module.require (module.js:585:17)
at require (internal/module.js:11:18)
at Object.
hey thanks e501, which node version are you running? I suspect you'll need node 6+, I should mention that in the docs. if that's not the issue, I'll try to reproduce it
When I checked my mongodb configuration, I decided to upgrade to the default 3.6 configuration for ubuntu 16.04. Thus, I completely uninstalled the older version and did a fresh reinstall with 3.6 community version of mongodb.
Unfortunately, I reproduced the same errors as before. The following is the output for the log file file for the fresh install of mongodb:
2018-04-28T10:29:20.847-0700 I STORAGE [initandlisten] createCollection: admin.system.version with provided UUID: 5db1afbd-0191-49f2-b000-a51d94e85c04 2018-04-28T10:29:21.115-0700 I COMMAND [initandlisten] setting featureCompatibilityVersion to 3.6 2018-04-28T10:29:21.117-0700 I STORAGE [initandlisten] createCollection: local.startup_log with generated UUID: d31624be-d30c-450a-9e2c-9ddb6ea8cabd 2018-04-28T10:29:21.243-0700 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/diagnostic.data' 2018-04-28T10:29:21.243-0700 I NETWORK [initandlisten] waiting for connections on port 27017 2018-04-28T10:34:21.243-0700 I STORAGE [thread1] createCollection: config.system.sessions with generated UUID: 80bf0be9-7471-4f42-9e78-303babd3b85d 2018-04-28T10:34:21.513-0700 I INDEX [thread1] build index on: config.system.sessions properties: { v: 2, key: { lastUse: 1 }, name: "lsidTTLIndex", ns: "config.system.sessions", expireAfterSeconds: 1800 } 2018-04-28T10:34:21.513-0700 I INDEX [thread1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2018-04-28T10:34:21.514-0700 I INDEX [thread1] build index done. scanned 0 total records. 0 secs 2018-04-28T10:34:21.515-0700 I COMMAND [thread1] command config.$cmd command: createIndexes { createIndexes: "system.sessions", indexes: [ { key: { lastUse: 1 }, name: "lsidTTLIndex", expireAfterSeconds: 1800 } ], $db: "config" } numYields:0 reslen:98 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_msg 271ms 2018-04-28T10:38:23.220-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54510 #1 (1 connection now open) 2018-04-28T10:38:23.223-0700 I NETWORK [conn1] received client metadata from 127.0.0.1:54510 conn1: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:38:23.229-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54512 #2 (2 connections now open) 2018-04-28T10:38:23.229-0700 I NETWORK [conn2] received client metadata from 127.0.0.1:54512 conn2: { driver: { name: "nodejs", version: "2.2.33" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.9.9-040909-generic" }, platform: "Node.js v9.0.0, LE, mongodb-core: 2.1.17" } 2018-04-28T10:39:38.320-0700 I NETWORK [listener] connection accepted from 127.0.0.1:54516 #3 (3 connections now open)
The node version reported by mongodb (platform: "Node.js v9.0.0, , LE, mongodb-core: 2.1.17") is in agreement with what I checked via the command line. Also, looks like your code is making the connections with mongodb.
In the error output to the console, there is a peculiar looking double-dot in the following: "Object.Module._extensions..js (module.js:652:10)". I am new to javascript and so not sure how to best interpret.
If possible, I would like to explore doing a local build of the source and try that as a baseline that would help me learn how to further explore these sorts of issues.
Many thanks for helping track down why the code does not want to run properly.
From an initial swag at a local build, looks like there may be an issue with the loader resolving the "infobox" module path for the wtf_wikipedia module/package.
hey, think i figured this ../infoxbox
error - it was titlecased, but some operating systems seem to support that.
yeah, mongo 3 is correct. added to the docs.
can you try v3.0.1
?
cheers
Very nice !
afwiki executed with no problems and loaded the pages in mongodb
When I ran dumpster (v3.0.2) for enwiki, the following is the initial listing of console messages (which included a "JavaScript heap out of memory" error):
----------
oh hi π
total file size: 62.2 GB
4 cpu cores detected.
- each worker will be given: 15.5 GB -
----------
- wrote 1,000 pages - 14.2s (worker #23291)
page: #952 - "Alexander of Greece (disambiguation)"
page: #952 - "Alexander of Greece (disambiguation)"
<--- Last few GCs --->
[23296:0x3bb31b0] 49449 ms: Mark-sweep 1200.4 (1434.3) -> 1200.3 (1434.3) MB, 314.3 / 0.1 ms allocation failure GC in old space requested [23296:0x3bb31b0] 49698 ms: Mark-sweep 1200.3 (1434.3) -> 1200.3 (1413.3) MB, 249.1 / 0.0 ms last resort GC in old space requested [23296:0x3bb31b0] 49966 ms: Mark-sweep 1200.3 (1413.3) -> 1200.3 (1407.3) MB, 267.8 / 0.0 ms last resort GC in old space requested
<--- JS stacktrace --->
==== JS stack trace =========================================
0: ExitFrame [pc: 0xe810118427d]
Security context: 0x30dfecda06a9
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node::Abort() [/usr/local/bin/node]
2: 0x87b56c [/usr/local/bin/node]
3: v8::Utils::ReportOOMFailure(char const, bool) [/usr/local/bin/node]
4: v8::internal::V8::FatalProcessOutOfMemory(char const, bool) [/usr/local/bin/node]
5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/usr/local/bin/node]
6: v8::internal::String::SlowFlatten(v8::internal::Handle
page: #2,826 - "Calvin Coolidge"
... over a million pages and mongo reports "enwiki 9.845GB"
At the console, after reporting that "a worker has finished", the same message keeps getting printed, as shown in the following:
page: #1,188,005 - "Steelyard balance"
wrote 1,251,113 pages - 0.2s (worker #23291)
πͺ a worker has finished πͺ
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
page: #1,189,073 - "The History Mix Volume 1"
Terminated the initial run (as discussed previously) and re-ran the code. Note that Wikipedia reports a total of 5,631,118 content pages (https://en.wikipedia.org/wiki/Special:Statistics)
The code ran to completion with the following console messages:
- wrote 1,251,000 pages - 0.1s (worker #24376)
- wrote 1,251,113 pages - 0.3s (worker #24376)
- wrote 1,251,113 pages - 0.0s (worker #24376)
πͺ a worker has finished πͺ
- 3 workers still running -
πͺ a worker has finished πͺ
- 2 workers still running -
- wrote 1,242,000 pages - 4.8s (worker #24364)
- wrote 1,242,000 pages - 0.1s (worker #24364)
page: #1,189,073 - "The History Mix Volume 1"
wrote 1,251,113 pages - 0.0s (worker #24364)
πͺ a worker has finished πͺ
πͺ a worker has finished πͺ
page: #1,189,073 - "The History Mix Volume 1"
π closing down.
-- final count is 1,189,073 pages -- took 2.6 hours π
From some web searches, similar "JavaScript heap out of memory" errors were resolved by increasing the "--max_old_space_size".
Thus, I dropped/deleted the enwiki database in mongodb and reran dumpster with the following command line: "node --max_old_space_size=4000000 /usr/local/lib/node_modules/dumpster-dive/bin/dumpster ./enwiki-latest-pages-articles.xml"
The code did not report any error messages to the console and finished up with the following:
page: #1,183,402 - "Jonathan Marray"
wrote 1,236,000 pages - 4.1s (worker #26756)
wrote 1,236,000 pages - 3.7s (worker #26739)
wrote 1,225,000 pages - 2.5s (worker #26749)
wrote 1,248,000 pages - 2.5s (worker #26740)
wrote 1,237,000 pages - 3.5s (worker #26756)
wrote 1,237,000 pages - 2.9s (worker #26739)
wrote 1,249,000 pages - 2.8s (worker #26740)
wrote 1,226,000 pages - 3.2s (worker #26749)
wrote 1,238,000 pages - 3.2s (worker #26756)
wrote 1,250,000 pages - 2.8s (worker #26740) page: #1,187,042 - "Peermade"
wrote 1,238,000 pages - 3.1s (worker #26739)
wrote 1,227,000 pages - 2.7s (worker #26749)
wrote 1,251,000 pages - 2.5s (worker #26740)
wrote 1,228,000 pages - 2.3s (worker #26749)
wrote 1,239,000 pages - 3.6s (worker #26756)
wrote 1,251,113 pages - 0.2s (worker #26740)
wrote 1,239,000 pages - 2.7s (worker #26739)
πͺ a worker has finished πͺ
wrote 1,229,000 pages - 1.8s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,240,000 pages - 2.1s (worker #26756)
wrote 1,240,000 pages - 2.1s (worker #26739)
wrote 1,230,000 pages - 1.8s (worker #26749)
wrote 1,241,000 pages - 2.4s (worker #26756)
wrote 1,241,000 pages - 2.5s (worker #26739)
wrote 1,231,000 pages - 2.2s (worker #26749)
wrote 1,242,000 pages - 2.9s (worker #26756)
wrote 1,242,000 pages - 2.8s (worker #26739)
wrote 1,232,000 pages - 2.2s (worker #26749)
wrote 1,243,000 pages - 2.4s (worker #26756)
wrote 1,243,000 pages - 2.5s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,233,000 pages - 2.9s (worker #26749)
wrote 1,244,000 pages - 2.3s (worker #26756)
wrote 1,244,000 pages - 2.3s (worker #26739)
wrote 1,234,000 pages - 2.6s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,245,000 pages - 2.4s (worker #26756)
wrote 1,245,000 pages - 2.8s (worker #26739)
wrote 1,235,000 pages - 2.7s (worker #26749)
wrote 1,246,000 pages - 2.0s (worker #26756)
wrote 1,246,000 pages - 2.1s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,247,000 pages - 2.3s (worker #26756)
wrote 1,236,000 pages - 2.9s (worker #26749)
wrote 1,247,000 pages - 2.4s (worker #26739)
wrote 1,248,000 pages - 2.0s (worker #26756)
wrote 1,237,000 pages - 2.6s (worker #26749)
wrote 1,248,000 pages - 2.3s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,249,000 pages - 2.7s (worker #26756)
wrote 1,238,000 pages - 2.6s (worker #26749)
wrote 1,249,000 pages - 2.6s (worker #26739)
wrote 1,250,000 pages - 2.3s (worker #26756)
wrote 1,239,000 pages - 2.2s (worker #26749)
wrote 1,250,000 pages - 2.5s (worker #26739) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,240,000 pages - 2.0s (worker #26749)
wrote 1,251,000 pages - 2.7s (worker #26756)
wrote 1,251,113 pages - 0.2s (worker #26756)
wrote 1,251,000 pages - 2.2s (worker #26739)
πͺ a worker has finished πͺ
wrote 1,251,113 pages - 0.2s (worker #26739)
πͺ a worker has finished πͺ
wrote 1,241,000 pages - 2.2s (worker #26749)
wrote 1,242,000 pages - 2.3s (worker #26749)
wrote 1,243,000 pages - 2.0s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,244,000 pages - 2.0s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,245,000 pages - 2.0s (worker #26749)
wrote 1,246,000 pages - 1.9s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,247,000 pages - 1.9s (worker #26749)
wrote 1,248,000 pages - 1.8s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,249,000 pages - 2.1s (worker #26749)
wrote 1,250,000 pages - 1.9s (worker #26749) page: #1,189,073 - "The History Mix Volume 1"
wrote 1,251,000 pages - 2.0s (worker #26749)
wrote 1,251,113 pages - 0.2s (worker #26749)
πͺ a worker has finished πͺ
0 workers still running -
π closing down.
-- final count is 1,189,073 pages -- took 111.7 minutes π
hey, thanks. That's really helpful with the --max_old_space_size
arg.
Yeah, that's for en-wikipedia? Something is off then, the count should be a lot closer to ~4.5m. A friend of mine ran it yesterday and had something similar - i think there's some errors being buried somewhere.
i'll try to pick-through it tomorrow. Think it's probably something about connections to mongo timing out, or something like that.
will ping you when i have something. thanks
Thanks for feedback ... yea, Wikipedia stats reports a total of 5,631,118 content pages (https://en.wikipedia.org/wiki/Special:Statistics)
While working through the details for why we are not getting all the pages loaded, I would like to explore how to best leverage the structure that you have implemented for the parsed pages. From the pages that I have revisited, there seems to be some baseline procedures/processes that can provide more NLP oriented statistical summaries (e.g. sentence length, number of sentences per section).
Any feedback/thoughts that you may have, are also of interest.
I agree that could be useful, though I'd make it optional if it has any performance impact at all.
BTW, I was going to try out the Redis option again, but I don't see the instructions for invoking it in the current readme. How do I do it?
It's working so far, but it's still demanding 8 "cores" on a 4-core machine, which makes getting anything else done at that machine very difficult. Is there a way yet to specify the number of cores/threads?
Whoops....It's been running for a couple of hours now, but I just realized it hasn't put a thing in the database--nor has it thrown an error.
OK, here's what happened in my case:
The app started up 7 workers. 6 of those finished, each having processed close to 800,000 articles--that doesn't add up to quite the expected 5.6 million, but it's in the ballpark. However, the 7th worker, which was lagging the whole time, simply froze a bit short of 500,000 articles, and that was all she wrote. In addition, as I pointed out above, nothing ever got added to the database; in fact, the database was never even created.
thanks @MTKnife - it didn't write anything to the database?? the logic around database names and collection names has changed, can you double-check that?
show dbs
use whatever
show collections
yeah, about that trailing worker, that sounds familiar. I'm gonna update the mongo lib and and some logging, so we can catch this bad-boy. I think probably the connection is just timing-out and needs to be restarted
@MTKnife did your output look roughly like this?
No, actually, that's another thing...I never got the article titles, and I never got anything else after that, either--just the initial (wrong) detection of number of cores (it detects threads, not cores), followed by page number updates. I never got the "a worker has finished" message, despite the fact that 6 workers finished.
As for the database, my instance never showed anything but "admin" and "local". I can try it again tomorrow to see if a restart does anything, but the server was responding just fine to commands in the terminal window.
It occurs to me that the behavior I saw (workers getting almost but not quite to the expected number of articles, and not seeing any "finished" messages) probably indicates that all 7 workers froze at different points.
updated to latest mondodb drivers and improved the logging, in 3.0.3
. you'll need to npm install
maybe to try it.
I'm about to try it now, as soon as I can get MongoDB upgraded from 3.4 to 3.6.
BTW, yes, I think an npm install
is always mandatory on a Windows system, unless you create a symlink from the user directory where npm deposits its copy to the local git repo.
EDIT: Actually, while the symlink ("directory junction" using mklink
in Windows worked before, I can't get it to work now, so yes, an npm install
will always be necessary.
OK, just tried running it, and wow, that was ugly:
$ dumpster C:/data/Wikipedia/enwiki-20180420-pages-articles.xml filesize: 66771313462
----------
oh hi π
total file size: 62.2 GB
8 cpu cores detected.
- each worker will be given: 7.8 GB -
----------
C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62 console.log(chalk.green(' -skipping page: "' + pageObj.title + '"')) ^
TypeError: Cannot read property 'title' of null at donePage (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62:64) at parseLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\01-parseLine.js:20:7) at LineByLineReader.lr.on (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:77:13) at emitOne (events.js:116:13) at LineByLineReader.emit (events.js:211:7) at LineByLineReader._nextLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:129:8) at Immediate._onImmediate (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:134:9) at runCallback (timers.js:794:20) at tryOnImmediate (timers.js:752:5) at processImmediate [as _immediateCallback] (timers.js:729:5) C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62 console.log(chalk.green(' -skipping page: "' + pageObj.title + '"')) ^
TypeError: Cannot read property 'title' of null
at donePage (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:62:64)
at parseLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\01-parseLine.js:20:7)
at LineByLineReader.lr.on (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\src\worker\index.js:77:13)
at emitOne (events.js:116:13)
at LineByLineReader.emit (events.js:211:7)
at LineByLineReader._nextLine (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:129:8)
at Immediate._onImmediate (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\line-by-line\line-by-line.js:134:9)
at runCallback (timers.js:794:20)
at tryOnImmediate (timers.js:752:5)
at processImmediate [as _immediateCallback] (timers.js:729:5)
--uncaught process error--
{ ProcessTerminatedError: cancel after 0 retries!
at tasks.filter.forEach.task (C:\Users\sorr\AppData\Roaming\npm\node_modules\dumpster-dive\node_modules\worker-nodes\lib\pool.js:111:39)
at Array.forEach (
After that, there's more of the same, but it stops within a few seconds.
Still no database being added in MongoDB, BTW.
EDIT: While the error messages stopped, and the program didn't actually issue any more output, I did still have to ctrl-C to get it to exit.
hey folks, few reports of people finally getting to the end without errors on 3.1.0
π
try it out?
been refactoring a lot of the template work in wtf_wikipedia, so expect some changes with that formatting in the next few releases
Many thanks for the update. Downloaded the most recent dump of enwiki last night and now running dumpster. A couple "duplicate entries" scrolled by while watching the console. Also, the following error also scrolled by:
---Error on "Ikeje Asogwa" TypeError: Cannot read property 'replace' of undefined
As I was typing this, the following message indicated that one of the workers finished: current: 3,261,461 pages - "Swimming at the 2013 World Aquatics Championships β Men's 50 metre breaststroke"
πͺ worker #2 has finished πͺ
- 3 workers still running -
current: 3,268,998 pages - "Rosalie Woodruff"
... the system monitor confirms that only three workers are still active. Seems that this may be a premature termination of the worker thread/process.
On a separate, yet related topic, I would like to track down more details and learn more about the practical issues/challenges for parsing of Wikipedia pages that are based on categories, lists, and navigation templates (https://en.wikipedia.org/wiki/Wikipedia:Categories,_lists,_and_navigation_templates#Navigation_templates). Of particular interest, are navbox and sidebar navigation-templates (https://en.wikipedia.org/wiki/Wikipedia:Navigation_template).
Looking forward to working with a more complete set of Wikipedia pages.
Later today, I'll be sure to post an update.
Thanks again !
Another error reported to console ... respective message context follows: current: 4,745,998 pages - "Thomas Stelzer"
---Error on "Gustavo Isaza MejΓa" TypeError: Cannot read property 'replace' of undefined
Another worker finished ... respective console messages follow: current: 4,989,998 pages - "Guitar Rock Tour"
πͺ worker #0 has finished πͺ
- 2 workers still running -
current: 4,996,068 pages - "Malato"
Note that I opted NOT to skip the disambiguation and redirect pages.
hey thanks e501, that's great!
by any chance did you get line number for that Ikeje Asogwa
or Gustavo Isaza MejΓa
errors? I'll try to hunt that down today.
Yeah, there is a great deal of difference between the progress of the workers, sometimes 20% or so, having one worker finish before the others is a drag, and something I'm currently thinking about how to improve on. Any ideas on this are welcomed. If we're expecting 5m pages or so, it sounds reasonable that one will finish early at 3.2m.
Yeah, the template world in en-wikipedia alone is pretty crazy. I'd like to see a version of dumpster roll-through the navboxes too, but right now it's sticking with pages in the 0 namespace. You can see the parsing logic in wtf_wikipedia coming along here, but it's not very clever or well-done right now.
Parsing the navboxes would be cool, and doable with that code. Dumpster-dive would need to pay-attention to namespace 10. It's a cool idea
oh hey, just found that error. nevermind.
Wow ... only took a little over two hours to load in the rest of the pages
current: 5,650,035 pages - "Charles John Talbot"
πͺ worker #1 has finished πͺ
- 0 workers still running -
π closing down.
-- final count is 5,652,891 pages --
took 2.2 hours
π
you did en-wikipedia in 2 hours?! how??! What's your computer like?!!
In hindsight, I think that I still had the pages from the previous runs of the previous versions that terminated early ... I forgot to check and verify that I deleted the enwiki from previous runs.
Since I downloaded the most recent version of enwiki, I am deleted the db in mongodb, and re-running the code.
oh hi π
size: 62.7 GB
4 workers
15.7 GB each
estimate: 4.6 hrs
---------------------------
Hmmm ...
The code ran for about the same amount of time ...
π closing down.
-- final count is 5,652,891 pages --
took 2.2 hours
π
... but the enwiki db is only 27.090GB
May be that the recent server upgrade to i7, DDR4 RAM, and RAID array helps speed up the process. Currently re-running the dumpster code ... as expected/hoped, the duplicate pages are recognized and no new pages are be entered/added.
On a separate note, for the mongodb entries, I am not sure how to identify which sentences go with which paragraphs with a respective section. Ideally, I would like to be able to make this next level of distinction down to the paragraph resolution. Please let me know if there may be some additional metadata for further segmenting the sentences.
Lots of fun !
;) yeah! paragraphs are not something we currently support, but should!
yeah, you're the current record-keeper for parsing speed. I wanna try scripting ec2 stuff, to make it a one-liner to run this on a big machine. If you have any experience with this, I'd love some help.
It's likely gonna be a few days before I get a chance to try it again, but I'm looking forward to it.
More of a browser-based (e.g. liquid-computing) type of approach might be more interesting and enable better utilization of a broad range of personal (and shared) web-enabled computational devices (e.g. desktops/workstations, laptops, phones) [1]. This may also address some of the data-streaming aspects of the task and enable more of a dataflow-oriented model of computation.
Look forward to your feedback/comments and suggestions.
REFS [1] Pham, Quoc-Viet, et al. "Decentralized Computation Offloading and Resource Allocation in Heterogeneous Networks with Mobile Edge Computing." arXiv preprint arXiv:1803.00683 (2018) (https://arxiv.org/pdf/1803.00683)
Glad to see such great progress !
I downloaded the latest english Wikipedia dump (enwiki-latest-pages-articles.xml.bz2), as documented in the readme. Extracted via the archive manager (OS: Ubuntu 16.04).
Now loading into mongodb via command line (dumpster ./enwiki-latest-pages-articles.xml)
As the script is running, an error message is repeatedly displayed: "Error: Cannot find module '../../infobox/infobox' ..."
Within mongo shell, the enwiki database is not yet visible.
Need to know if I should best terminate terminate this run (e.g. cntrl-c) and restart with the infobox option enabled. Ideally, I would like to have the infobox pages with the other articles in mongo.
Greatly appreciate the continued progress with enabling this type of capability !