spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
240 stars 46 forks source link

Other feeds #84

Open SimonBurfield opened 4 years ago

SimonBurfield commented 4 years ago

Great work for fixing this mate in 5.3.0

Importing EN now, do you know of other feeds people use with it?

Have you ever thought about doing something like this with the CommonCrawl?

spencermountain commented 4 years ago

thanks, ya - a number of people are running this on random external wikis, but given the streaming xml read, and the multiple worker setup, i'm sure it could be applied to other datasets.

Haven't looked at commoncrawl in a long time, glad they're still around. 280TB is too big for my laptop! Happy to help, if you were inclined to get it working. cheers