idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Issue creating corpus - Invalid byte 2 of 4-byte UTF-8 sequence #36

Open ArthurCamara opened 7 years ago

ArthurCamara commented 7 years ago

I'm trying to manually create a corpus, using the following command: java -Xmx10G -Xms10G -cp target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki working/enwiki-latest-pages-articles-multistream.xml.bz2 /mnt/hd0/Arthur/data/en-wiki-latest.lines

resulting in the following error:

[Fatal Error] :965698439:106: Invalid byte 2 of 4-byte UTF-8 sequence. Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 965698439; columnNumber: 106; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ... 5 more

I'm using the latest wikipedia dump, and the sha1sum matches.

Any idea on what can be causing this?

jind11 commented 6 years ago

I am having a similar problem.

dav009 commented 6 years ago

@jind just curious is it with the english wikipedia?

jind11 commented 6 years ago

yes

tgalery commented 6 years ago

Haven't had much time to dig into this, but here's a couple questions. Would you have the same error for older dumps of English and / or other languages ?

sunan93 commented 6 years ago

I am also having the same issue. If anyone has come across the fix, please share it here.