spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
241 stars 46 forks source link

Output #62

Closed SeansRightHere closed 6 years ago

SeansRightHere commented 6 years ago

Does it modify the existing file, or output to a different file?

aymansalama commented 6 years ago

Super cool project. thank you so much. I was working on similar project for couple of months. By the way, I am offering my help if you need a dirty hand. i have some experience. let me know if you need to know more about me.

I got dumpster-dive and it seems to be working on Google Cloud GCP. forgive my naive questions below:

  1. Even dumpster recognise that I have 24 Core on GCP, still it run on 6 cores only.
  2. Time estimate doesn't work well. the predicted time is too small compared to reality
  3. I was working on enwiki, part 1 which is 30K++ pages, it only uploaded 600++ pages. I am not sure if there multi pages per page. I am still investigating. Possibly there are many duplicates
  4. for the whole enwiki, i got tens of errors like:

====error!=== { MongoError: BSONObj size: 16927305 (0x1024A49) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16927305 (0x1024A49) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} }

====error!=== { MongoError: BSONObj size: 16891063 (0x101BCB7) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16891063 (0x101BCB7) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} }

====error!=== { MongoError: BSONObj size: 16954395 (0x102B41B) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16954395 (0x102B41B) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} }

====error!=== { MongoError: BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} } ====error!=== { MongoError: BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} } ====error!=== { MongoError: BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} } ====error!=== { MongoError: BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages" at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63 at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16) at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5) at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10) at Socket. (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:453:17) at Socket.emit (events.js:182:13) at addChunk (_stream_readable.js:283:12) at readableAddChunk (_stream_readable.js:264:11) at Socket.Readable.push (_stream_readable.js:219:10) at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) ok: 0, errmsg: 'BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"', code: 10334, codeName: 'BSONObjectTooLarge', name: 'MongoError', [Symbol(mongoErrorContextSymbol)]: {} }

attached the current working logs

dumpster-log-pat1.txt

spencermountain commented 6 years ago

hi @SeansRightHere it does not modify the xml file at all. It outputs to a mongodb database.

hey Ayman! yeah! I need you. I'm a clumsy web-developer and am often in-over-my-head.

1) core-count - yeah it's getting that from require('os').cpus().length, which I assumed worked but haven't tested on anything fancy.

2) estimate time this is a pretty silly mbyte-per-minute ballpark number. Maybe we should change this, or do a smarter strategy.

3) missing pages whoa, that sounds bad. Lemme know if you can trace that down. I don't think there would be duplicate titles in the dump. If there are duplicates it should yell-about it here after mongo tries to write them.

4) errors yeah, it writes pages in batches, to speed-up, but I think we need to turn this down. it defaults to 800 pages at a time now, but happy to change this to whatever.

cheers

spencermountain commented 6 years ago

feel free to jump right in, PRs always welcome, i'm pretty-easy!

SeansRightHere commented 6 years ago

I’m not really a coder, sorry. If put directly to DB, is there a way I can specify output file?

On Mon, Sep 17, 2018 at 9:25 AM spencer kelly notifications@github.com wrote:

hi @SeansRightHere https://github.com/SeansRightHere it does not modify the xml file at all. It outputs to a mongodb database.

hey Ayman! yeah! I need you. I'm a clumsy web-developer and am often in-over-my-head.

1.

core-count - yeah it's getting that from require('os').cpus().length https://github.com/spencermountain/dumpster-dive/blob/master/src/01-prepwork.js#L4, which I assumed worked but haven't tested on anything fancy. 2.

estimate time this is a pretty silly mbyte-per-minute https://github.com/spencermountain/dumpster-dive/blob/master/src/02-Worker-pool.js#L12 ballpark number. Maybe we should change this, or do a smarter strategy. 3.

missing pages whoa, that sounds bad. Lemme know if you can trace that down. I don't think there would be duplicate titles in the dump. If there are duplicates it should yell-about it here https://github.com/spencermountain/dumpster-dive/blob/master/src/worker/03-write-db.js#L25 after mongo tries to write them. 4.

errors yeah, it writes pages in batches, to speed-up, but I think we need to turn this down. it defaults to 800 https://github.com/spencermountain/dumpster-dive/blob/master/config.js#L3 pages at a time now, but happy to change this to whatever.

cheers

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/spencermountain/dumpster-dive/issues/62#issuecomment-422037396, or mute the thread https://github.com/notifications/unsubscribe-auth/AYwryK_FXEoxq5_guSEwEq7KMQ1GqbsYks5ub7DCgaJpZM4Wq7vQ .

spencermountain commented 6 years ago

@SeansRightHere yeah, you can look around them mongo documentation for that. When you start mongo, you can specify where it saves your database. mongod --dbpath=/whatever/data/path

aymansalama commented 6 years ago

@spencermountain excellent work. Thank you for your reply. the scripts are still running and cpus are still hot. I am running everything on GCP. I will let you know if i get more errors. I have another notice, but not sure yet if it is a valid remark. I tried to run two dumpster command for two different dataset, and i think when one is running, the other goes silent and never progress. When i try to load this file of wikidatawiki, it takes too long and then the loaded data in mongo has a lot special characters and unicodes. sampel file name wikidatawiki-20180901-pages-articles.xml.bz2 47.1 GB whoever it works well on enwiki-latest-pages-articles.xml and the data is loaded properly.

Let me know how i can help. i have good experience in web development. you can email me with the details on ay.salama@gmail.com

spencermountain commented 6 years ago

whoa, wikidata has an xml dump? is it in the same format? I doubt this library would work on a wikidata export, but it would probably see the xml and try to do stuff. what does the wikidata xml look like? can you do a head command on it?

SeansRightHere commented 6 years ago

It had20-30 errors on mine but 6 hours later, said it complied the 147Gb of Wiki xml

On Mon, Sep 17, 2018 at 9:40 PM spencer kelly notifications@github.com wrote:

whoa, wikidata has an xml dump? is it in the same format? I doubt this library would work on a wikidata export, but it would probably see the xml and try to do stuff. what does the wikidata xml look like? can you do a head command on it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/dumpster-dive/issues/62#issuecomment-422234905, or mute the thread https://github.com/notifications/unsubscribe-auth/AYwryBL3WqLVDDc11-RmbuPsAh-qxEcYks5ucF03gaJpZM4Wq7vQ .

aymansalama commented 6 years ago

@spencermountain wikidata avail its dump in three format JSON, RDF, XML. Ref: https://www.wikidata.org/wiki/Wikidata:Database_download Actually dumpster is working on it very hard. it processes it and upload some data. Visually you can see encoding issues in the uploaded data. Actually as far as i understand, it doesn't make sense to use dumpster to load wikidata in xml format. Since wikidata already availed its data in JSON which you can upload directly to mongo. am I correct ?

i was uploading the files of wikidata for experimental purposes.

Here is the head -150 of the xml of wikidata. i have attached the file. Of course you can download the whole files from the link above.

head-150-wikidatawiki-20180901-pages-articles1.xml-p1p235321.txt

Wikidata wikidatawiki https://www.wikidata.org/wiki/Wikidata:Main_Page MediaWiki 1.32.0-wmf.19 first-letter Media Special Talk User User talk Wikidata Wikidata talk File File talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Property Property talk Query Query talk Lexeme Lexeme talk Module Module talk Translations Translations talk Gadget Gadget talk Gadget definition Gadget definition talk Topic Wikidata:Main Page/Content 4 1 665868355 624548684 2018-04-15T22:38:57Z Dataeast0000001 2952466 removing unwanted space wikitext text/x-wiki <div lang="{{{lang|{{int:lang}}}}}" class="mw-content-{{dir|{{{lang|{{int:lang}}}}}}}"> <noinclude><languages /></noinclude> <!-------------------------------------------- HEADER ---------------------------------------------> {{Main Page/Header new | welcome = <translate><!--T:1-->
spencermountain commented 6 years ago

yes, that's correct. thanks, wikidata is so weird!

spencermountain commented 6 years ago

gonna close this, think problems are covered elsewhere