languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.43k stars 1.4k forks source link

[LO-add-on] Text analysis very slow and the log file full of errors #9240

Closed marcoagpinto closed 9 months ago

marcoagpinto commented 1 year ago

Heya, @FredKruse

I have been using the Microsoft Word 365 add-on to revise my 633-page thesis.

However, today I opened it with LibreOffice and the latest nightly extension.

It is terribly slow analysing the text, and the log file throws several errors:

CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXXXXXXXXXXXXXXXXXXX.odt
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXXXXXXXXXX.odt
CacheIO: CacheCleanUp: Delete cache file: C:\XXXXXXXXXXXXXXXXXX\cache\LtCache1.lcz
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXXXXXX.odt
MultiDocumentsHandler: getNumDoc: Document 2 created; docID = 4
Time to generate cache(5): 699750
java.lang.IndexOutOfBoundsException: Index: 4, Size: 0
    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at org.languagetool.openoffice.DocumentCache.getFlatParagraphNumber(DocumentCache.java:1480)
    at org.languagetool.openoffice.SingleDocument.getNextQueueEntry(SingleDocument.java:764)
    at org.languagetool.openoffice.TextLevelCheckQueue.getNextQueueEntry(TextLevelCheckQueue.java:308)
    at org.languagetool.openoffice.TextLevelCheckQueue$QueueIterator.run(TextLevelCheckQueue.java:545)

Disposing document has no content: Wait for 1000 milliseconds
com.sun.star.uno.RuntimeException: range has no mark (table?)
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy34.getPropertyValue(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getSortedTextId(DocumentCursorTools.java:423)
    at org.languagetool.openoffice.DocumentCursorTools.getAllTextParagraphs(DocumentCursorTools.java:324)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:209)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.uno.RuntimeException: SwXTextCursor: disposed or invalid
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy33.gotoNextParagraph(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getAllTextParagraphs(DocumentCursorTools.java:310)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:209)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.lang.DisposedException: 
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy38.getTextTables(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getIndexAccessOfAllTables(DocumentCursorTools.java:717)
    at org.languagetool.openoffice.DocumentCursorTools.getTextOfAllTables(DocumentCursorTools.java:733)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:210)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.lang.DisposedException: 
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy42.getDrawPage(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getTextOfAllShapes(DocumentCursorTools.java:594)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:211)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.lang.DisposedException: 
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy44.getFootnotes(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getTextOfAllFootnotes(DocumentCursorTools.java:834)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:212)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.lang.DisposedException: 
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy45.getEndnotes(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getTextOfAllEndnotes(DocumentCursorTools.java:892)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:213)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

com.sun.star.lang.DisposedException: 
    at com.sun.star.bridges.jni_uno.JNI_proxy.dispatch_call(Native Method)
    at com.sun.star.bridges.jni_uno.JNI_proxy.invoke(JNI_proxy.java:185)
    at com.sun.proxy.$Proxy46.getStyleFamilies(Unknown Source)
    at org.languagetool.openoffice.DocumentCursorTools.getPagePropertySets(DocumentCursorTools.java:946)
    at org.languagetool.openoffice.DocumentCursorTools.getTextOfAllHeadersAndFooters(DocumentCursorTools.java:977)
    at org.languagetool.openoffice.DocumentCache.refreshWriterCache(DocumentCache.java:214)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:173)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:277)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:175)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:242)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:180)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

WARNING: DocumentCache: refresh: paragraphContainer == null - ParagraphCache not initialised
FredKruse commented 1 year ago

@marcoagpinto: Do you use LibreOffice 7.6? This version uses a new feature, that should make the generation of cache easier and faster (not slower). Could you please test, if the same problems happen, if you save your document as ODT, close LO and open the document again? (if not, it is a problem with the DOCX format.) Could you reproduce the problem with an easy DOCX-document? Maybe tomorrow I can take a look at the problem. After that, I'll be on vacation for two weeks.

marcoagpinto commented 1 year ago

Yes, I have the latest LibreOffice, 7.6.0.3, with Windows 11.

After installing the nightly, I turned on some rules for pt-PT.

The thesis was already in .odt when I sent you the log.

First, I opened the .docx, saved as .odt, closed LibreOffice, deleted the LanguageTool log and opened the .ODT.

After some 20+ minutes of wait, the scroll was working fine, then I disabled the rule that complains about the paragraph length.

Then I was scrolling down and it was green lining some words.

I kept scrolling and clicking on the green lines, until it stopped showing the pop-up menu options (very old bug?).

So, I went to the add-on toolbar and clicked on refresh the text analysis, then it became again slow as hell.

So, I closed LibreOffice and pasted the log here.

Does this help?

Thanks!

FredKruse commented 10 months ago

@marcoagpinto In the current developer version (6.4) I have introduced a number of improvements that particularly affect performance. Would you like to test the version with your document and give me feedback?

marcoagpinto commented 10 months ago

@FredKruse

Sure, I will do it tonight.

Last week or so (I can't remember) I unzipped the .oxt and changed all “temp_off” rules to “on” and replaced goal-specific keywords with a space, to use all Portuguese rules, and while scrolling down the thesis I was getting tons of Java errors.

marcoagpinto commented 10 months ago

@FredKruse

I am very sad with the latest nightly.

I downloaded it, installed it, opened a small Portuguese .ODT, and then I went to the folder settings and deleted the cache + config + log, and clicked on the button to default the settings and turned on the multicore use option.

The files appeared again as expected.

I closed the latest version of LibreOffice and also the quickstart icon.

I opened my thesis in .DOCX, set the whole document language to Portuguese and saved as .ODT.

As I opened the thesis, there was no longer the LanguageTool toolbar, and the log file had errors in it:

LT office integration log from Thu Jan 04 04:20:26 GMT 2024

LanguageTool 6.4-SNAPSHOT (2024-01-03 17:52:33 +0000, 3b06ad5)
OS: Windows 11 10.0 on amd64
LibreOffice 7.6.4.1 (The Document Foundation), en-GB
Java-Version: 1.8.0_391, max. Heap-Space: 7262 MB, LT Heap Space Limit: 6536 MB

MultiDocumentsHandler: getLinguisticServices: linguServices set: is NOT null
CacheIO: getCachePath: cacheFileName == null!
MultiDocumentsHandler: getNumDoc: Document 0 created; docID = 1
SingleDocument: writeCaches: Copy DocumentCache
SingleDocument: writeCaches: Copy ResultCache 0
SingleDocument: writeCaches: Copy ResultCache 1
SingleDocument: writeCaches: Copy ResultCache 2
SingleDocument: writeCaches: Copy ResultCache 3
SingleDocument: writeCaches: Save Caches ...
SingleDocument: writeCaches: Copy DocumentCache
SingleDocument: writeCaches: Copy ResultCache 0
SingleDocument: writeCaches: Copy ResultCache 1
SingleDocument: writeCaches: Copy ResultCache 2
SingleDocument: writeCaches: Copy ResultCache 3
SingleDocument: writeCaches: Save Caches ...
java.lang.NullPointerException
    at org.languagetool.openoffice.FlatParagraphTools.getAllFlatParagraphs(FlatParagraphTools.java:309)
    at org.languagetool.openoffice.DocumentTextCache.refreshWriterCache(DocumentTextCache.java:262)
    at org.languagetool.openoffice.DocumentTextCache.refresh(DocumentTextCache.java:177)
    at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:67)
    at org.languagetool.openoffice.DocumentTextCache.refreshAndCompare(DocumentTextCache.java:2093)
    at org.languagetool.openoffice.CheckRequestAnalysis.handleCacheChanges(CheckRequestAnalysis.java:650)
    at org.languagetool.openoffice.CheckRequestAnalysis.getNumberOfParagraphFromSortedTextId(CheckRequestAnalysis.java:115)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:327)
    at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:185)
    at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:264)
    at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:202)
    at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

WARNING: DocumentCache: refresh: paragraphContainer == null - ParagraphCache not initialised
XDrawPageSupplier == null
WARNING: DocumentCache: refresh: paragraphContainer == null - ParagraphCache not initialised
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXX/LANGUAGETOOL TESTS/Sample doc 20230102.odt
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXX/LANGUAGETOOL TESTS/PhD_thesis_marcoagpinto_IST_1Main_V0092unsent.odt
MultiDocumentsHandler: getNumDoc: Document 1 created; docID = 2

Also, could you please add a button "Enable all Temp_off rules" so that I don't need to manually enable one by one, or having to unzip the oxt a do a replacement of all "temp_off" with "on"?

Thanks!

marcoagpinto commented 10 months ago

@FredKruse

Ahhh… look what just happened, which I last week thought it was because I replaced all “temp_off” with “on” but this time I didn't unzip the oxt: Screenshot 2024-01-04 043449

FredKruse commented 10 months ago

@FredKruse

I am very sad with the latest nightly.

I downloaded it, installed it, opened a small Portuguese .ODT, and then I went to the folder settings and deleted the cache + config + log, and clicked on the button to default the settings and turned on the multicore use option.

The files appeared again as expected.

I closed the latest version of LibreOffice and also the quickstart icon.

I opened my thesis in .DOCX, set the whole document language to Portuguese and saved as .ODT.

As I opened the thesis, there was no longer the LanguageTool toolbar, and the log file had errors in it:

LT office integration log from Thu Jan 04 04:20:26 GMT 2024

LanguageTool 6.4-SNAPSHOT (2024-01-03 17:52:33 +0000, 3b06ad5)
OS: Windows 11 10.0 on amd64
LibreOffice 7.6.4.1 (The Document Foundation), en-GB
Java-Version: 1.8.0_391, max. Heap-Space: 7262 MB, LT Heap Space Limit: 6536 MB

MultiDocumentsHandler: getLinguisticServices: linguServices set: is NOT null
CacheIO: getCachePath: cacheFileName == null!
MultiDocumentsHandler: getNumDoc: Document 0 created; docID = 1
SingleDocument: writeCaches: Copy DocumentCache
SingleDocument: writeCaches: Copy ResultCache 0
SingleDocument: writeCaches: Copy ResultCache 1
SingleDocument: writeCaches: Copy ResultCache 2
SingleDocument: writeCaches: Copy ResultCache 3
SingleDocument: writeCaches: Save Caches ...
SingleDocument: writeCaches: Copy DocumentCache
SingleDocument: writeCaches: Copy ResultCache 0
SingleDocument: writeCaches: Copy ResultCache 1
SingleDocument: writeCaches: Copy ResultCache 2
SingleDocument: writeCaches: Copy ResultCache 3
SingleDocument: writeCaches: Save Caches ...
java.lang.NullPointerException
  at org.languagetool.openoffice.FlatParagraphTools.getAllFlatParagraphs(FlatParagraphTools.java:309)
  at org.languagetool.openoffice.DocumentTextCache.refreshWriterCache(DocumentTextCache.java:262)
  at org.languagetool.openoffice.DocumentTextCache.refresh(DocumentTextCache.java:177)
  at org.languagetool.openoffice.DocumentCache.refresh(DocumentCache.java:67)
  at org.languagetool.openoffice.DocumentTextCache.refreshAndCompare(DocumentTextCache.java:2093)
  at org.languagetool.openoffice.CheckRequestAnalysis.handleCacheChanges(CheckRequestAnalysis.java:650)
  at org.languagetool.openoffice.CheckRequestAnalysis.getNumberOfParagraphFromSortedTextId(CheckRequestAnalysis.java:115)
  at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:327)
  at org.languagetool.openoffice.SingleDocument.getCheckResults(SingleDocument.java:185)
  at org.languagetool.openoffice.MultiDocumentsHandler.getCheckResults(MultiDocumentsHandler.java:264)
  at org.languagetool.openoffice.MultiDocumentsHandler.doProofreading(MultiDocumentsHandler.java:202)
  at org.languagetool.openoffice.Main.doProofreading(Main.java:80)

WARNING: DocumentCache: refresh: paragraphContainer == null - ParagraphCache not initialised
XDrawPageSupplier == null
WARNING: DocumentCache: refresh: paragraphContainer == null - ParagraphCache not initialised
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXX/LANGUAGETOOL TESTS/Sample doc 20230102.odt
CacheIO: CacheCleanUp: Remove Path from CacheMap: /C:/XXXXXXX/LANGUAGETOOL TESTS/PhD_thesis_marcoagpinto_IST_1Main_V0092unsent.odt
MultiDocumentsHandler: getNumDoc: Document 1 created; docID = 2

Also, could you please add a button "Enable all Temp_off rules" so that I don't need to manually enable one by one, or having to unzip the oxt a do a replacement of all "temp_off" with "on"?

Thanks!

@marcoagpinto The error looks really serious and seems to be deeply rooted inside the extension. I urgently need the document that is causing this issue. I cannot reproduce the error with my test files. Could you provide it to me? All other errors could be secondary errors.

marcoagpinto commented 10 months ago

@FredKruse

Sent!

Please DON'T share the document with anyone, as it is highly classified.

FredKruse commented 10 months ago

I will only use the document for testing and delete it afterward.

FredKruse commented 10 months ago

@FredKruse

Ahhh… look what just happened, which I last week thought it was because I replaced all “temp_off” with “on” but this time I didn't unzip the oxt: Screenshot 2024-01-04 043449

I found the bug for this. It is solved, know (next nightly).

The whole document was analyzed without problems in my tests. But it took a very long time (20 minutes on my laptop). One point is, that you use many English words in your text. All are marked as spelling errors. The extension uses LT as spell checker since 6.3. It runs very well if there are a few dozens of spelling errors in the text. If there are hundreds of spelling errors in the text, it slows down the check very much.

If you agree, I'll do a few more tests before deleting the document. Maybe I can find a way to speed up the checks.

marcoagpinto commented 10 months ago

@FredKruse

Sure, do as many tests as possible.

Thanks!

😋 😋 😋 😋 😋 😋

FredKruse commented 10 months ago

@marcoagpinto I added an option in the option dialog (general) to enable temporary disabled rules. It is turned off by default. You have to switch it on. But the configuration will be saved. Please test it at the next nightly.

marcoagpinto commented 10 months ago

@FredKruse

Thank you a lot!

❤️ ❤️ ❤️ ❤️ ❤️ ❤️ ❤️

FredKruse commented 10 months ago

@marcoagpinto The latest nightly also includes changes that speed up the LT checks by about 25%. Unfortunately, I have no idea of any other changes regarding the problem.

marcoagpinto commented 10 months ago

@FredKruse

Thank you, Fred, I still haven't tested it because @p-goulart committed changes in multiwords working after the release hour and only today they will come in the nightly (and I want to download the latest stand-alone tool + Wikipedia tool with those changes working).

My idea to speed up hundreds of words that appear as typos would be to store them in a dynamic array and check each word of a document there before checking in the Hunspell dictionaries… as the years go by, I get more and more ideas for complex algorithms.

marcoagpinto commented 10 months ago

@FredKruse

EDIT: For example: “Sou perito in fyrmware e cryptoware e fyrmware”.

While checking the document word by word, first check the word in a dynamic typo dictionary for that document.

If “fyrmware” isn't in the Hunspell dictionary, it would be added to a custom typo dictionary.

It would continue word by word, but now checking first in the typo dictionary if is it != 0 words in it.

If the words are in the typos array, it is underlined as a typo and breaks the loop from checking the in typos array, and DOESN'T check in the Hunspell dictionaries.

This should increase speed a lot.

p-goulart commented 10 months ago

@marcoagpinto the improvements to the dictionary include some language-independent elements. Hopefully everything is reviewed and merged on Monday, but it could also take a little while longer. Those changes are not live.

As for your idea of keeping a separate 'typo' dictionary, I'm not that convinced there would be a considerable performance boost. It is true that fetching suggestions is the more time-consuming part of the process, so any kind of bootstrapping we can add would be great. Note, though, that Morfologik already has an .info file that we can use to prioritise specific matches over others, and it already contains a rule for y -> i. In my experience it has sped up the spellchecking process somewhat for the most frequent typos.

marcoagpinto commented 10 months ago

@FredKruse

Heya!

Fantastic work!

It took around 2m22seconds to check the whole thesis on my 9th generation i7 laptop.

LibreOffice crashed when I created a blank document, but it can be a LibreOffice bug (the GUI became black).

Thank you!

FredKruse commented 9 months ago

@marcoagpinto Can we close this issue?