spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
240 stars 46 forks source link

Default options when not using `bin/dumpster.js` #95

Open shreyasminocha opened 3 years ago

shreyasminocha commented 3 years ago
  • disambiguation pages / redirects _--skipdisambig, _--skipredirects by default, dumpster skips entries in the dump that aren't full-on articles, you can
let obj = {
  file: './path/enwiki-latest-pages-articles.xml.bz2',
  db: 'enwiki',
  skip_redirects: false,
  skip_disambig: false
};
dumpster(obj, () => console.log('done!'));

I'm not sure if this is unintentional or if the docs are misleading, but the default options are applied only when invoking the dumpster bin script, and not when it's imported and used in a script like in the example above. So the snippet I quoted is identical to:

let obj = {
  file: './path/enwiki-latest-pages-articles.xml.bz2',
  db: 'enwiki'
};
dumpster(obj, () => console.log('done!'));

…and skipping redirects and disambiguation pages requires an explicit skip_redirects: true, skip_disambig: true.

I'm guessing this is also true of the other default options.

spencermountain commented 3 years ago

thanks @shreyasminocha yeah, you're right - the argv stuff is a mess and should be cleaned up. don't have a free afternoon now, but will mark it as a bug. prs welcome cheers