Closed conceptofmind closed 10 months ago
hahaha, that's awesome. what operating system are you on? they all have different rules about this. i can put a if statement that skips these long filenames, with some kinda flag in the config.
For now, you can add a your own filter, like:
doPage: function(doc){ return doc.title().length<500},
Good catch!
yeah, I'll add a default-on skip for anything > 255 characters, via this. cheers
hahaha, that's awesome. what operating system are you on? they all have different rules about this. i can put a if statement that skips these long filenames, with some kinda flag in the config.
For now, you can add a your own filter, like:
doPage: function(doc){ return doc.title().length<500},
Good catch!
Hi @spencermountain,
Thank you for the response.
I am using Ubuntu 18 as my operating system and nvm 16.
When running the script above on the Wikipedia XML files, they will start extracting but after numerous hours last night, they did not finish.
The outputs will look like this in the terminal:
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
604,692 337,024 75,830 96,733 37,321 2,170 6,970 361,617 52,239 21,972 36,386 38,809
But it seems to freeze after a certain amount of time. Around 1.695 million txt files were processed.
I greatly appreciate all of your help!
i reproduced this on a old linux box last night. Will take a look at what's going on. thanks
I appreciate the help!
The only other thing to mention is upon manual analysis it seems like some edge cases may be slipping through in relation to wtf Wikipedia.
Die 3de Dinastie van Antieke Egipte (ook Dinastie III geskryf) was die eerste dinastie van die Ou Ryk. Dit het geduur van omstreeks 2667 tot 2610 v.C. Dit het gevolg op die onstuimige jare van die 2de Dinastie.
Die 4de, 5de en 6de Dinastie word gewoonlik saam met die 3de Dinastie die Ou Ryk genoem, ook bekend as die tyd van die piramides. Die hoofstad in dié tyd was Memphis.
Die bekendste heersers van die 3de Dinastie was:
! scope=col width="15%" | Farao ! scope=col width="20%" | Troonnaam ! scope=col width="20%" | Bewind ! scope=col width="15%" | Hoofstad ! scope=col width="35%" | Graftombe
* Djoser
* Netjerichet
* 2667-2648 v.C.
* Memphis
* Saqqara
* Djoserty
* Sechemchet
* 2648-2640 v.C.
* Memphis
* Saqqara
* Nebka
* Sanacht
* 2640-2631 v.C.
* Memphis
* Abydos ?
* Teti
* Chaba
* 2631-2628 v.C.
* Memphis
* Zawjet el-Arjan
* Hoeni
* Qahedjet ?
* 2628-2610 v.C.
* Memphis
* Meidoem ?
* }
* Qahedjet ?
* 2628-2610 v.C.
* Memphis
* Meidoem ?
* }
I just wanted to confirm that this is right way to extract the text from the pages:
libPath: 'wtf_wikipedia', // (default)
doPage: function(doc){ return true}, // (default)
parse: function(doc){return doc.text()},
Thank you,
Enrico
hey Enrico, got a fix for both of these on dev, should have a release in the next few days cheers
both should be fixed now, in 2.0.0
cheers!
Hello,
Thank you for your awesome work in relation to dumpster-dive and dumpster-dip.
When running text extraction on enwiki:
I am getting this error:
I am wondering if you have any input on why this would occur?
Also whether
doc.json()
or onlydoc.text()
should be used to extract all of the page text.