languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.41k stars 1.39k forks source link

Problems with last nightly – LibreOffice + Wikipedia #6223

Open marcoagpinto opened 2 years ago

marcoagpinto commented 2 years ago

Hello!

When I installed the last nightly .oxt in LibreOffice 7.3.0.1 after the restart of LO it complained that a document was missing.

I tried twice and it happened twice.

I fixed it by removing the document from the LibreOffice panel that shows all recent documents.

Now I was trying the Wikipedia tool, and it throws errors (it doesn't work):

C:\Users\marco\Desktop\LanguageTool-wikipedia-20220107-snapshot\LanguageTool-wikipedia-5.7-SNAPSHOT>java -Dfile.encoding=UTF-8 -Xmx4500M -jar languagetool-wikipedia.jar check-data -l pt-PT -r VERB_COMMA_CONJUNCTION -f ptwiki-20121230-pages-articles.xml -f tatoeba-pt.txt --max-sentences 600000 --context-size 100 >before.txt
Exception in thread "main" java.lang.RuntimeException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[15026326,133]
Message: XML document structures must start and end within the same entity.
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.hasNext(WikipediaSentenceSource.java:85)
        at org.languagetool.dev.dumpcheck.MixingSentenceSource.hasNext(MixingSentenceSource.java:77)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.run(SentenceSourceChecker.java:239)
        at org.languagetool.dev.dumpcheck.SentenceSourceChecker.main(SentenceSourceChecker.java:74)
        at org.languagetool.dev.wikipedia.Main.main(Main.java:45)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[15026326,133]
Message: XML document structures must start and end within the same entity.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.handleTextElement(WikipediaSentenceSource.java:148)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.fillSentences(WikipediaSentenceSource.java:131)
        at org.languagetool.dev.dumpcheck.WikipediaSentenceSource.hasNext(WikipediaSentenceSource.java:83)
        ... 4 more
marcoagpinto commented 2 years ago

Ahhhh… the two files I am testing against are the ones provided by Daniel Naber years ago, since they have more than 600 000 sentences.

I now test against them and then against the file Daniel provided days ago that only has 460 000 sentences or so.