Closed SeansRightHere closed 6 years ago
Super cool project. thank you so much. I was working on similar project for couple of months. By the way, I am offering my help if you need a dirty hand. i have some experience. let me know if you need to know more about me.
I got dumpster-dive and it seems to be working on Google Cloud GCP. forgive my naive questions below:
====error!===
{ MongoError: BSONObj size: 16927305 (0x1024A49) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
at Socket.
====error!===
{ MongoError: BSONObj size: 16891063 (0x101BCB7) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
at Socket.
====error!===
{ MongoError: BSONObj size: 16954395 (0x102B41B) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
at Socket.
====error!===
{ MongoError: BSONObj size: 16912181 (0x1020F35) is invalid. Size must be between 0 and 16793600(16MB) First element: insert: "pages"
at /usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:581:63
at authenticateStragglers (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:504:16)
at Connection.messageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/pool.js:540:5)
at emitMessageHandler (/usr/lib/node_modules/dumpster-dive/node_modules/mongodb-core/lib/connection/connection.js:310:10)
at Socket.
attached the current working logs
hi @SeansRightHere it does not modify the xml file at all. It outputs to a mongodb database.
hey Ayman! yeah! I need you. I'm a clumsy web-developer and am often in-over-my-head.
1) core-count - yeah it's getting that from require('os').cpus().length, which I assumed worked but haven't tested on anything fancy.
2) estimate time this is a pretty silly mbyte-per-minute ballpark number. Maybe we should change this, or do a smarter strategy.
3) missing pages whoa, that sounds bad. Lemme know if you can trace that down. I don't think there would be duplicate titles in the dump. If there are duplicates it should yell-about it here after mongo tries to write them.
4) errors yeah, it writes pages in batches, to speed-up, but I think we need to turn this down. it defaults to 800 pages at a time now, but happy to change this to whatever.
cheers
feel free to jump right in, PRs always welcome, i'm pretty-easy!
I’m not really a coder, sorry. If put directly to DB, is there a way I can specify output file?
On Mon, Sep 17, 2018 at 9:25 AM spencer kelly notifications@github.com wrote:
hi @SeansRightHere https://github.com/SeansRightHere it does not modify the xml file at all. It outputs to a mongodb database.
hey Ayman! yeah! I need you. I'm a clumsy web-developer and am often in-over-my-head.
1.
core-count - yeah it's getting that from require('os').cpus().length https://github.com/spencermountain/dumpster-dive/blob/master/src/01-prepwork.js#L4, which I assumed worked but haven't tested on anything fancy. 2.
estimate time this is a pretty silly mbyte-per-minute https://github.com/spencermountain/dumpster-dive/blob/master/src/02-Worker-pool.js#L12 ballpark number. Maybe we should change this, or do a smarter strategy. 3.
missing pages whoa, that sounds bad. Lemme know if you can trace that down. I don't think there would be duplicate titles in the dump. If there are duplicates it should yell-about it here https://github.com/spencermountain/dumpster-dive/blob/master/src/worker/03-write-db.js#L25 after mongo tries to write them. 4.
errors yeah, it writes pages in batches, to speed-up, but I think we need to turn this down. it defaults to 800 https://github.com/spencermountain/dumpster-dive/blob/master/config.js#L3 pages at a time now, but happy to change this to whatever.
cheers
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/spencermountain/dumpster-dive/issues/62#issuecomment-422037396, or mute the thread https://github.com/notifications/unsubscribe-auth/AYwryK_FXEoxq5_guSEwEq7KMQ1GqbsYks5ub7DCgaJpZM4Wq7vQ .
@SeansRightHere yeah, you can look around them mongo documentation for that. When you start mongo, you can specify where it saves your database.
mongod --dbpath=/whatever/data/path
@spencermountain excellent work. Thank you for your reply. the scripts are still running and cpus are still hot. I am running everything on GCP. I will let you know if i get more errors. I have another notice, but not sure yet if it is a valid remark. I tried to run two dumpster command for two different dataset, and i think when one is running, the other goes silent and never progress. When i try to load this file of wikidatawiki, it takes too long and then the loaded data in mongo has a lot special characters and unicodes. sampel file name wikidatawiki-20180901-pages-articles.xml.bz2 47.1 GB whoever it works well on enwiki-latest-pages-articles.xml and the data is loaded properly.
Let me know how i can help. i have good experience in web development. you can email me with the details on ay.salama@gmail.com
whoa, wikidata has an xml dump? is it in the same format?
I doubt this library would work on a wikidata export, but it would probably see the xml and try to do stuff.
what does the wikidata xml look like? can you do a head
command on it?
It had20-30 errors on mine but 6 hours later, said it complied the 147Gb of Wiki xml
On Mon, Sep 17, 2018 at 9:40 PM spencer kelly notifications@github.com wrote:
whoa, wikidata has an xml dump? is it in the same format? I doubt this library would work on a wikidata export, but it would probably see the xml and try to do stuff. what does the wikidata xml look like? can you do a head command on it?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spencermountain/dumpster-dive/issues/62#issuecomment-422234905, or mute the thread https://github.com/notifications/unsubscribe-auth/AYwryBL3WqLVDDc11-RmbuPsAh-qxEcYks5ucF03gaJpZM4Wq7vQ .
@spencermountain wikidata avail its dump in three format JSON, RDF, XML. Ref: https://www.wikidata.org/wiki/Wikidata:Database_download Actually dumpster is working on it very hard. it processes it and upload some data. Visually you can see encoding issues in the uploaded data. Actually as far as i understand, it doesn't make sense to use dumpster to load wikidata in xml format. Since wikidata already availed its data in JSON which you can upload directly to mongo. am I correct ?
i was uploading the files of wikidata for experimental purposes.
Here is the head -150 of the xml of wikidata. i have attached the file. Of course you can download the whole files from the link above.
head-150-wikidatawiki-20180901-pages-articles1.xml-p1p235321.txt
yes, that's correct. thanks, wikidata is so weird!
gonna close this, think problems are covered elsewhere
Does it modify the existing file, or output to a different file?