spencermountain / dumpster-dive

roll a wikipedia dump into mongo
Other
240 stars 46 forks source link

FATAL ERROR of memory while index #3

Closed alemol closed 9 years ago

alemol commented 9 years ago

Hello Spencer, i am trying to load wikipedia in spanish and i got this error (twice). What could i do to finish the process?

$ node index.js eswiki-latest-pages-articles.xml Andorra Argentina Geografía de Andorra Demografía de Andorra Comunicaciones de Andorra Artes visuales Agricultura Astronomía galáctica ASCII Arquitectura Anoeta Ana María Matute Agujero negro Antropología Anarquía FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory Abort trap: 6

Thanks in advance

spencermountain commented 9 years ago

hey Alejandro, thanks for this. I actually think this was fixed yesterday. can you update to v0.1.10 and try it again? this works for me:

wtf_wikipedia.from_api("Anarquía", "es", function (s) {
  console.log(JSON.stringify(wtf_wikipedia.parse(s), null, 2))
})
alemol commented 9 years ago

After update wtf_wikipedia it seems that its working (i'll tell you at the end of the process).

alemol commented 9 years ago

It has crashed at this point $ node index.js eswiki-latest-pages-articles.xml ... El deseo (telenovela)

/Users/amolina/onto/node_modules/wtf_wikipedia/src/index.js:115 translations[lang] = s.match(/^[[([a-z][a-z]):(.*?)]]/i)[2] ^ TypeError: Cannot read property '2' of null at /Users/amolina/onto/node_modules/wtf_wikipedia/src/index.js:115:71 at Array.forEach (native) at Object.main as parse at XmlStream. (/Users/amolina/onto/index.js:28:26) at XmlStream.emit (events.js:118:17) at fn (/Users/amolina/onto/node_modules/xml-stream/lib/xml-stream.js:132:14) at FiniteAutomata.run (/Users/amolina/onto/node_modules/xml-stream/lib/finite-automata.js:32:19) at FiniteAutomata.leave (/Users/amolina/onto/node_modules/xml-stream/lib/finite-automata.js:85:7) at null. (/Users/amolina/onto/node_modules/xml-stream/lib/xml-stream.js:434:8) at emit (events.js:107:17)

I think some regular expression fix could help

spencermountain commented 9 years ago

thanks man, sorry about that. I have never run it through the eswikipedia. fixed this in v0.1.11, please let me know if you run into anything else.

p.s. i also fixed an issue i noticed on that page, where the infobox 'the fisha' was being parsed, but not removed from the text. This may help in other spanish articles too. cheers

alemol commented 9 years ago

I confirm: the new version works for spanish: $ node index.js eswiki-latest-pages-articles.xml

... (like three hours later) =================done========

Thank you Spencer,

spencermountain commented 9 years ago

boom!