spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
240 stars 46 forks source link

Not all pages are loaded into the afwiki database #89

Closed vyank closed 4 years ago

vyank commented 4 years ago

I ran the code with command : node ./bin/dumpster.js full_path_to_afwiki-latest-pages-articles.xml

Total 117,650 pages loaded in few minutes.

I found that few pages are not loaded into database e.g. Aapstertsteekgras

I have copied the data into xml files as below. Cound not understand why this and few other pages are left unloaded into database. Not sure if I have missed any option to run the command.

Please help if you know the reason. Thank you.

Aapstertsteekgras xml fragment is as below:

Aapstertsteekgras.txt

spencermountain commented 4 years ago

hi vyank, thanks for the good issue. I just checked this text on the article parser, and it seems good. I'll download the af-dump and take a look at why the article doesn't load. cheers

vyank commented 4 years ago

Hey spencermountain, thanks for for reply. I digged more into this and what I saw is that there is below error

Error: key bls. must not contain '.' at serializeInto (\node_modules\bson\lib\bson\parser\serializer.js:913:19) at serializeObject (\node_modules\bson\lib\bson\parser\serializer.js:347:18) at serializeInto (\node_modules\bson\lib\bson\parser\serializer.js:727:17) at serializeObject (\node_modules\bson\lib\bson\parser\serializer.js:347:18) at serializeInto (\node_modules\bson\lib\bson\parser\serializer.js:937:17) at serializeObject (\node_modules\bson\lib\bson\parser\serializer.js:347:18) at serializeInto (\node_modules\bson\lib\bson\parser\serializer.js:727:17) at serializeObject (\node_modules\bson\lib\bson\parser\serializer.js:347:18) at serializeInto (\node_modules\bson\lib\bson\parser\serializer.js:937:17) at serializeObject (\node_modules\bson\lib\bson\parser\serializer.js:347:18)

I printed the titles and checked more in 03-write-db.js and found that few batches of pages(batch size of 500) are not written into the database becuase of above issue. Even thogh only few pages had this issue all pages are missed in the batch.

I guess wtf is returning a json for few pages where one of the keys of the json file has a dot(.) and that is where the problem is.

Hope this helps. If I found out more I will update you.

spencermountain commented 4 years ago

ahh thank you. I am happy to fix this, this week.

vyank commented 4 years ago

Instead of insertMany I ran insertOne and found that below pages failing becuase of a dot in the key.

Amritsar-slagting Oz (TV-reeks) Cyrildene Kousale gelaagde analise Driehoek-sterrestelsel

example page attached. Amritsar-slagting1.txt

You can see that there is a key at the bottom of the file bls.

Becuase of this the whole batches were failing. But I thouht the whole purpose of insertMany with ordered = false to skip the failed pages, but its skipping the while batch.

Hope this helps.

spencermountain commented 4 years ago

gonna try and squeeze this in tomorrow! thanks for your patience

spencermountain commented 4 years ago

hey @vyank this should be fixed now, in 5.4.0. please let me know if you see any other issues. thanks!

vyank commented 4 years ago

Thank you @spencermountain . Appreciate it. I will check it out.