spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
240 stars 46 forks source link

allow upserting mongo docs on re-runs #116

Closed einSelbst closed 3 months ago

einSelbst commented 3 months ago

I saw that the Readme said on multiple runs an 'upsert' is used but when I looked in the code it seemed not, it uses `insertMany'. So I tried to verify by dl'ing simplewiki, dumpster-dive it, gave me 247004 documents.

On a new run this time with --plaintext=true all entries were recognized as duplicates (existing), but not updated. When deleting one entry from the collection and re-running

dumpster simplewiki-20240501-pages-articles.xml --plaintext=true

this time the one entry was recreated with the plaintext field.

So I had a look into mongo and they have 'bulk ops', https://www.mongodb.com/docs/manual/core/bulk-write-operations/

Btw. insertMany is using bulkWrite under the hood, so they say. bulkWrite also supports updates, etc. so with some help from chatgpt I updated the code to use bulkwrite.

In my few local runs it even seems the bulkwrite is faster than the insertMany on the initial creation.

Screenshot 2024-06-02 at 20 48 19

Btw, there is also a commit which adds a missing 'L' to the docs when using 'lbzip2', which I forgot in the other pr, sorry.

Regarding your question: "Is there a way to add this as a node dependency, so we can add it to the library??"

It seems the only way to do this would be to wrap the lbzip2 lib in a node package. If you think that would be worth the effort I might give it a try.

A few more questions:

What do you think about mentioning https://github.com/huggingface/Mongoku in the docs? It's very convenient to explore the data in the mongodb.

What do you think about mentioning/adding info regarding indexing?

It's seems dumpster-dip has the more recent code. Do you envision dumpster-dive to work the same way regarding eg. the way a local wtf can be used? I might port it then.

Anything else you think would be important for this project to do? Not sure how long I'll work with it but until then I'm happy to contribute.

spencermountain commented 3 months ago

hey cool!! thank you! Yeah, I trust you regarding the change to bulkWrite. I will test it this weekend, before doing a release. That's a terrific help.

Yeah, as you mentioned, it should be able to run over an existing table, and update it to the current one, without creating any duplicate rows. I haven't tested this, but assume it was just by the row _id. Happy for any help you can provide on this. Please let me know if you test it on other sizes of wikipedia, or other OSes. Mongo can run out of memory, or be weird sometimes.

Yep - I did some work on a personal project this summer using dumpster-dip, and pushed it ahead of this one. The projects seemed different-enough, although there is some overlap. I plan to maintain them both. dumpster-dip now has a wizard-thing via npx script, and maybe this library should too. Welcome to changes on either, or both.

ya - if it's possible to get lbzip2 safely into npm via node-gyp, or anything else, that would be amazing. I've never done anything like that. It will need to be cross-platform, or have a clean fallback. What a neat idea. Please let me know if I can help, in any way.

cheers!

spencermountain commented 3 months ago

oh, and I didn't know about Mongoku - and I am far from an expert on indexing and the like. Any additions to the docs, or the wiki are very welcomed.

spencermountain commented 2 months ago

didn't get to testing this on the weekend, but plan to pair it with a wtf release, this weekend. cheers