Closed aymansalama closed 5 years ago
yeah, this is weird. This is on google-cloud env? my guess Ayman is that it has something to do with html output. That may be why others are getting more articles. Have you tried leaving the html param out? Worth trying.
new release tomorrow, hopefully
@spencermountain yes, it is google cloud env. I have done this experiement yesterday, and here is the results without html Date: 23 Sept 2018
mongod --fork --syslog --dbpath /data/mongo-3-6-trial5/
2.Command used
sudo nohup dumpster enwiki-latest-pages-articles.xml --citations=false --workers=10 &
I have downloaded a new enwiki dump of 20 Sept and trying again as below
Trial 6
1.Date: 24 Sept 2018
2.Database used
mongod --fork --syslog --dbpath /data/mongo-3-6-trial6/
3.Command used
dumpster enwiki-20180920-pages-articles.xml --batch_size 100 &
4.VERSION DUMPSTER-DIVE 3.6.1
I will let you know the results.
@spencermountain I have tried again. this time no parameter except batch size. I didn't use nohup and left the session running for 3 hours. Unfortuantly same results of 2.5m articles loaded.
I will wait for your new update.
Trial 7
Date: 24 Sept 2018
Database used
mongod --fork --syslog --dbpath /data/mongo-3-6-trial7/
Command used
dumpster enwiki-20180920-pages-articles.xml --batch_size 100 &
VERSION DUMPSTER-DIVE 3.6.1
Number of uploaded pages
👍 closing down.
-- final count is 2,576,233 pages --
took 3.1 hours
🎉
hey @aymansalama wanna test this on dumpster-dive v4.0.0
?
a lot has changed, so who knows.
I've also added verbose_skip
as an option, so you can log the titles of the pages that are being skipped as redirects or disambig pages
@spencermountain I can't thank you enough for your hard work.
Regarding the count, still around 2.5m articles. unfortunately, it seems that I am doing something fundamentally wrong. As I am still getting the same exact count. ` 👍 closing down.
-- final count is 2,496,586 pages -- took 118.1 minutes 🎉`
I am not sure what I am doing wrong. here is my exact steps. @spencermountain please let me know if you can help.
- Step one, I download enwiki as below
$sudo nohup wget https://dumps.wikimedia.org/enwiki/20180920/enwiki-20180920-pages-articles.xml.bz2 &
get the size of the downloaded unziped file
$du -sh enwiki-20180920-pages-articles.xml
65G enwiki-20180920-pages-articles.xml
- Step two, run fresh mongo
$mongod --fork --syslog --dbpath /data/mongospace/mongo-4-trail1/
$mongo --version
MongoDB shell version v4.0.2
git version: fc1573ba18aee42f97a3bb13b67af7d837826b47
OpenSSL version: OpenSSL 1.1.0f 25 May 2017
allocator: tcmalloc
modules: none
build environment:
distmod: debian92
distarch: x86_64
target_arch: x86_64
- Step three, Run the dumpster
$ dumpster --version
4.0.0
$sudo nohup dumpster enwiki-20180920-pages-articles.xml --verbose_skip true 2>&1 &
uname -a
Linux wiki-processor-vm-t 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
@spencermountain here is some number about the skipping notifications Number of notification with keyword "skip"
sudo grep skip nohup.out | wc -l
5023472
Number of unique notification with the word 'skipping redirect:'
sudo grep 'skipping redirect:' nohup.out | sort | uniq | wc -l
3990517
Number of unique notification with the word 'skipping disambiguation:'
sudo grep 'skipping disambiguation:' nohup.out | sort |
uniq | wc -l
137023
hey @aymansalama sorry that you're still having trouble. Sounds like we need some better logging, to figure out what's going wrong. 3.9m redirects seems like a lot. I'll re-run it on my machine and see how many I get there. otherwise, what you're doing sounds right. I got 5m+ pages on my local linux box this week. maybe it's a artifact of the google cloud thing.
you've tried it without nohup
, right?
I'll look at adding counts for the skip logic, per worker, so we can get some insight into what's going on there. it may be the environment, some newline encoding, or something about communication between workers on google-cloud. we'll get it.
hey, i've added per-worker logging information about redirects, disambig pages, and namespace-skipping, every 20 seconds.
please check it out on 4.0.1
@spencermountain thank you so much for the dedication. Q: you've tried it without nohup, right? A: Yes, I tried without nohup. I have 8 trials with different combinations. i will try the code on another machine with the new code and will let you know soon.
hey @aymansalama you ever get this to work?
@spencermountain unfortunately i am holding the progress in here for a while as i am fully engaged in several other tasks. I am planning to be back in here within two weeks after my deliver deadlines in the other tasks. Sorry for being slow.
no sweat. closing this issue for now good luck!
sure, no worries. Good luck and thanks for the hard work
Dumpster-dive version 3.1.0
results stopped at page count 186,441, script finish with no error.
Dumpster-dive version 3.6.1
Script finish and page count is 1,274,403 with no error.
The number of pages suppose to be 5M++.