Open ArthurCamara opened 7 years ago
I am having a similar problem.
@jind just curious is it with the english wikipedia?
yes
Haven't had much time to dig into this, but here's a couple questions. Would you have the same error for older dumps of English and / or other languages ?
I am also having the same issue. If anyone has come across the fix, please share it here.
I'm trying to manually create a corpus, using the following command:
java -Xmx10G -Xms10G -cp target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki working/enwiki-latest-pages-articles-multistream.xml.bz2 /mnt/hd0/Arthur/data/en-wiki-latest.lines
resulting in the following error:
[Fatal Error] :965698439:106: Invalid byte 2 of 4-byte UTF-8 sequence. Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 965698439; columnNumber: 106; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ... 5 more
I'm using the latest wikipedia dump, and the sha1sum matches.
Any idea on what can be causing this?