Extracting Wikipedia Text

conceptofmind commented 10 months ago

Hello,

Thank you for your awesome work in relation to dumpster-dive and dumpster-dip.

When running text extraction on enwiki:

import dip from 'dumpster-dip'

const opts = {
  input: "./enwiki-latest-pages-articles.xml",
  // directory for all our new files
  outputDir: './enwiki/',
  // how we should write the results
  outputMode: 'flat',

  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // define how many concurrent workers to run
  //interval to log status
  heartbeat: 5000, //every 5 seconds

  // parse redirects, too
  redirects: false, // (default)
  // parse disambiguation pages, too
  disambiguation: true, // (default)

  // allow a custom wtf_wikipedia parsing library
  libPath: 'wtf_wikipedia', // (default)

  // should we skip this page or return something?
  doPage: function(doc){ return true}, // (default)

  // what do return, for every page
  parse: function(doc){return doc.json()}, // (default)  - avoid using an arrow-function

}

// this promise takes ~4hrs
dip(opts).then(() => {
  console.log('done!')
})

I am getting this error:

Error: ENAMETOOLONG: name too long, open '/wikipedia/enwiki/Protocol_Amending_the_Agreements%2C_Conventions_and_Protocols_on_Narcotic_Drugs_concluded_at_The_Hague_on_23_January_1912%2C_at_Geneva_on_11_February_1925_and_19_February_1925%2C_and_13_July_1931%2C_at_Bangkok_on_27_November_1931_and_at_Geneva_on_26_June_1936.txt'
    at Object.openSync (node:fs:590:3)
    at Object.writeFileSync (node:fs:2202:35)
    at writeFile (file:///home/henry/node_modules/dumpster-dip/src/output/index.js:20:6)
    at output (file:///home/henry/node_modules/dumpster-dip/src/output/index.js:37:5)
    at eachPage (file:///home/henry/node_modules/dumpster-dip/src/worker/index.js:64:5)
    at SundayDriver.each (file:///home/henry/node_modules/dumpster-dip/src/worker/01-reader.js:18:7)
    at SundayDriver.doChunk (/home/henry/node_modules/sunday-driver/src/index.js:38:8)
    at doit (/home/henry/node_modules/sunday-driver/src/index.js:77:10)
    at /home/henry/node_modules/sunday-driver/src/index.js:80:9
    at /home/henry/node_modules/sunday-driver/src/index.js:41:5 {
  errno: -36,
  syscall: 'open',
  code: 'ENAMETOOLONG',
  path: '/wikipedia/enwiki/Protocol_Amending_the_Agreements%2C_Conventions_and_Protocols_on_Narcotic_Drugs_concluded_at_The_Hague_on_23_January_1912%2C_at_Geneva_on_11_February_1925_and_19_February_1925%2C_and_13_July_1931%2C_at_Bangkok_on_27_November_1931_and_at_Geneva_on_26_June_1936.txt'
}

I am wondering if you have any input on why this would occur?

Also whether doc.json() or only doc.text() should be used to extract all of the page text.

spencermountain commented 10 months ago

hahaha, that's awesome. what operating system are you on? they all have different rules about this. i can put a if statement that skips these long filenames, with some kinda flag in the config.

For now, you can add a your own filter, like:

doPage: function(doc){ return doc.title().length<500},

Good catch!

spencermountain commented 10 months ago

yeah, I'll add a default-on skip for anything > 255 characters, via this. cheers

conceptofmind commented 10 months ago

hahaha, that's awesome. what operating system are you on? they all have different rules about this. i can put a if statement that skips these long filenames, with some kinda flag in the config.

For now, you can add a your own filter, like:
doPage: function(doc){ return doc.title().length<500},
Good catch!

Hi @spencermountain,

Thank you for the response.

I am using Ubuntu 18 as my operating system and nvm 16.

When running the script above on the Wikipedia XML files, they will start extracting but after numerous hours last night, they did not finish.

The outputs will look like this in the terminal:

 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809
 604,692    337,024     75,830     96,733     37,321      2,170      6,970    361,617     52,239     21,972     36,386     38,809

But it seems to freeze after a certain amount of time. Around 1.695 million txt files were processed.

I greatly appreciate all of your help!

spencermountain commented 10 months ago

i reproduced this on a old linux box last night. Will take a look at what's going on. thanks

conceptofmind commented 10 months ago

I appreciate the help!

The only other thing to mention is upon manual analysis it seems like some edge cases may be slipping through in relation to wtf Wikipedia.

Die 3de Dinastie van Antieke Egipte (ook Dinastie III geskryf) was die eerste dinastie van die Ou Ryk. Dit het geduur van omstreeks 2667 tot 2610 v.C. Dit het gevolg op die onstuimige jare van die 2de Dinastie.

Die 4de, 5de en 6de Dinastie word gewoonlik saam met die 3de Dinastie die Ou Ryk genoem, ook bekend as die tyd van die piramides. Die hoofstad in dié tyd was Memphis.

Die bekendste heersers van die 3de Dinastie was:

! scope=col width="15%" | Farao ! scope=col width="20%" | Troonnaam ! scope=col width="20%" | Bewind ! scope=col width="15%" | Hoofstad ! scope=col width="35%" | Graftombe
 * Djoser
 * Netjerichet
 * 2667-2648 v.C.
 * Memphis
 * Saqqara
 * Djoserty
 * Sechemchet
 * 2648-2640 v.C.
 * Memphis
 * Saqqara
 * Nebka
 * Sanacht
 * 2640-2631 v.C.
 * Memphis
 * Abydos ?
 * Teti
 * Chaba
 * 2631-2628 v.C.
 * Memphis
 * Zawjet el-Arjan
 * Hoeni
 * Qahedjet ?
 * 2628-2610 v.C.
 * Memphis
 * Meidoem ?
 * }
 * Qahedjet ?
 * 2628-2610 v.C.
 * Memphis
 * Meidoem ?
 * }

I just wanted to confirm that this is right way to extract the text from the pages:

libPath: 'wtf_wikipedia', // (default)

doPage: function(doc){ return true}, // (default)

parse: function(doc){return doc.text()},

Thank you,

Enrico

spencermountain commented 10 months ago

hey Enrico, got a fix for both of these on dev, should have a release in the next few days cheers

spencermountain commented 10 months ago

both should be fixed now, in 2.0.0 cheers!

spencermountain / dumpster-dip

Extracting Wikipedia Text #4