spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
242 stars 45 forks source link

missing pages, silent #65

Closed aymansalama closed 5 years ago

aymansalama commented 6 years ago

Dumpster-dive version 3.1.0

sudo nohup dumpster enwiki-latest-pages-articles.xml --batch_size 100 &
sudo nohup dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 &

results stopped at page count 186,441, script finish with no error.

Dumpster-dive version 3.6.1

dumpster enwiki-latest-pages-articles.xml --html=true --images=false --batch_size 100 --verbose true --workers 20

Script finish and page count is 1,274,403 with no error.

The number of pages suppose to be 5M++.

spencermountain commented 6 years ago

yeah, this is weird. This is on google-cloud env? my guess Ayman is that it has something to do with html output. That may be why others are getting more articles. Have you tried leaving the html param out? Worth trying.

new release tomorrow, hopefully

aymansalama commented 6 years ago

@spencermountain yes, it is google cloud env. I have done this experiement yesterday, and here is the results without html Date: 23 Sept 2018

  1. Database used mongod --fork --syslog --dbpath /data/mongo-3-6-trial5/ 2.Command used sudo nohup dumpster enwiki-latest-pages-articles.xml --citations=false --workers=10 &
  2. VERSION DUMPSTER-DIVE 3.6.1
  3. Number of uploaded pages: 2,357,382

I have downloaded a new enwiki dump of 20 Sept and trying again as below Trial 6 1.Date: 24 Sept 2018 2.Database used mongod --fork --syslog --dbpath /data/mongo-3-6-trial6/ 3.Command used dumpster enwiki-20180920-pages-articles.xml --batch_size 100 & 4.VERSION DUMPSTER-DIVE 3.6.1

I will let you know the results.

aymansalama commented 6 years ago

@spencermountain I have tried again. this time no parameter except batch size. I didn't use nohup and left the session running for 3 hours. Unfortuantly same results of 2.5m articles loaded.

I will wait for your new update. Trial 7 Date: 24 Sept 2018 Database used mongod --fork --syslog --dbpath /data/mongo-3-6-trial7/ Command used dumpster enwiki-20180920-pages-articles.xml --batch_size 100 & VERSION DUMPSTER-DIVE 3.6.1 Number of uploaded pages 👍 closing down.

 -- final count is 2,576,233 pages --
   took 3.1 hours
          🎉
spencermountain commented 6 years ago

hey @aymansalama wanna test this on dumpster-dive v4.0.0? a lot has changed, so who knows.

I've also added verbose_skip as an option, so you can log the titles of the pages that are being skipped as redirects or disambig pages

aymansalama commented 6 years ago

@spencermountain I can't thank you enough for your hard work.

I am not sure what I am doing wrong. here is my exact steps. @spencermountain please let me know if you can help.

- Step one, I download enwiki as below

$sudo nohup wget https://dumps.wikimedia.org/enwiki/20180920/enwiki-20180920-pages-articles.xml.bz2 &

get the size of the downloaded unziped file
$du -sh enwiki-20180920-pages-articles.xml
65G     enwiki-20180920-pages-articles.xml

- Step two, run fresh mongo

$mongod --fork --syslog --dbpath /data/mongospace/mongo-4-trail1/

$mongo --version
MongoDB shell version v4.0.2
git version: fc1573ba18aee42f97a3bb13b67af7d837826b47
OpenSSL version: OpenSSL 1.1.0f  25 May 2017
allocator: tcmalloc
modules: none
build environment:
    distmod: debian92
    distarch: x86_64
    target_arch: x86_64

- Step three, Run the dumpster

$ dumpster --version
4.0.0

$sudo nohup dumpster enwiki-20180920-pages-articles.xml --verbose_skip true 2>&1 &

uname -a
Linux wiki-processor-vm-t 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux

@spencermountain here is some number about the skipping notifications Number of notification with keyword "skip"

 sudo grep skip nohup.out | wc -l 
5023472

Number of unique notification with the word 'skipping redirect:'

sudo grep 'skipping redirect:' nohup.out | sort | uniq | wc -l 
3990517

Number of unique notification with the word 'skipping disambiguation:'

 sudo grep 'skipping disambiguation:' nohup.out | sort |
 uniq | wc -l 
137023
spencermountain commented 6 years ago

hey @aymansalama sorry that you're still having trouble. Sounds like we need some better logging, to figure out what's going wrong. 3.9m redirects seems like a lot. I'll re-run it on my machine and see how many I get there. otherwise, what you're doing sounds right. I got 5m+ pages on my local linux box this week. maybe it's a artifact of the google cloud thing.

you've tried it without nohup, right?

I'll look at adding counts for the skip logic, per worker, so we can get some insight into what's going on there. it may be the environment, some newline encoding, or something about communication between workers on google-cloud. we'll get it.

spencermountain commented 6 years ago

hey, i've added per-worker logging information about redirects, disambig pages, and namespace-skipping, every 20 seconds. please check it out on 4.0.1

aymansalama commented 6 years ago

@spencermountain thank you so much for the dedication. Q: you've tried it without nohup, right? A: Yes, I tried without nohup. I have 8 trials with different combinations. i will try the code on another machine with the new code and will let you know soon.

spencermountain commented 5 years ago

hey @aymansalama you ever get this to work?

aymansalama commented 5 years ago

@spencermountain unfortunately i am holding the progress in here for a while as i am fully engaged in several other tasks. I am planning to be back in here within two weeks after my deliver deadlines in the other tasks. Sorry for being slow.

spencermountain commented 5 years ago

no sweat. closing this issue for now good luck!

aymansalama commented 5 years ago

sure, no worries. Good luck and thanks for the hard work