henhenfauzi / text-mining

Automatically exported from code.google.com/p/text-mining
0 stars 0 forks source link

StringIndexOutOfBoundsException in Word97TextExtractor #1

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run the attached unit test against the attached Word 97 document.
2. Observe that the test fails with the following stack trace:
java.lang.StringIndexOutOfBoundsException: String index out of range: 3480
    at java.lang.AbstractStringBuilder.substring
(AbstractStringBuilder.java:879)
    at java.lang.StringBuffer.substring(StringBuffer.java:416)
    at org.textmining.extraction.word.Word97TextExtractor.getText
(Word97TextExtractor.java:138)
    at org.textmining.extraction.word.Word97TextExtractor.getText
(Word97TextExtractor.java:63)
    at com.textmining.test.TextMiningTest.testWord97
(TextMiningTest.java:14)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.internal.runners.TestMethodRunner.executeMethodBody
(TestMethodRunner.java:99)
    at org.junit.internal.runners.TestMethodRunner.runUnprotected
(TestMethodRunner.java:81)
    at org.junit.internal.runners.BeforeAndAfterRunner.runProtected
(BeforeAndAfterRunner.java:34)
    at org.junit.internal.runners.TestMethodRunner.runMethod
(TestMethodRunner.java:75)
    at org.junit.internal.runners.TestMethodRunner.run
(TestMethodRunner.java:45)
    at 
org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod
(TestClassMethodsRunner.java:71)
    at org.junit.internal.runners.TestClassMethodsRunner.run
(TestClassMethodsRunner.java:35)
    at org.junit.internal.runners.TestClassRunner$1.runUnprotected
(TestClassRunner.java:42)
    at org.junit.internal.runners.BeforeAndAfterRunner.runProtected
(BeforeAndAfterRunner.java:34)
    at org.junit.internal.runners.TestClassRunner.run
(TestClassRunner.java:52)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run
(JUnit4TestReference.java:38)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run
(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests
(RemoteTestRunner.java:460)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests
(RemoteTestRunner.java:673)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run
(RemoteTestRunner.java:386)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main
(RemoteTestRunner.java:196)

What is the expected output? What do you see instead?
The expected output would be a successful run of the unit test, with the 
text successfully extracted.

What version of the product are you using? On what operating system?
TextMining 1.0.
Microsoft Windows XP Professional 2002 Service Pack 2

Please provide any additional information below.

Original issue reported on code.google.com by dgoldenb...@yahoo.com on 22 Apr 2008 at 4:01

Attachments:

GoogleCodeExporter commented 9 years ago
I found this issue too. I've worked around it by adding sanity checks to the 
bounds.
See the code below - the stuff I added is commented

    for (int x = 0; x < textRuns.size(); x++)
    {
      CHPX chpx = (CHPX)textRuns.get(x);
      if (!isDeleted(chpx.getGrpprl()))
      {
        /**
         * Begin Sanity checks
         *      1. If end > length force end == length
         *      2. if start > end or start > length, skip
         */
        int end = (chpx.getEnd()>allTxt.length()) ? allTxt.length() : chpx.getEnd();
        if ((chpx.getStart()>chpx.getEnd()) || (chpx.getStart()>allTxt.length())) continue;
        /**
         * End sanity checks
         */
        String str = allTxt.substring(chpx.getStart(), end);
        scrubber.append(stringWriter, str);
      }
    }

Original comment by mikebel...@gmail.com on 16 Jun 2009 at 5:27