google-code-export / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 1 forks source link

ContentFusion can change the order of document text #61

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. When processing a document with the ContentFusion class the text of the 
document can get out of order if changes are made in multiple iteration of the 
dowhile.
2. When changes are made and two TextBlocks are merged the outer loop is 
executed again (reprocessing the entire document for more changes), however, 
the prevBlock variable is not reset to the first block of the document (It 
still contains the last block of the document). This can cause block(s) at the 
beginning of the document to be merged at the end of the document. 

What is the expected output? What do you see instead?
Blocks at the beginning of the document are merged to the end of blocks at the 
end of the document. These blocks should not be merged at all or should be 
merged to the beginning of later blocks.

What version of the product are you using? On what operating system?
most recent from repository

Please provide any additional information below.
My recommendation would be moving the prevBlock instantiation inside the 
dowhile loop.

From:
TextBlock prevBlock = textBlocks.get(0);

boolean changes = false;
do {
    changes = false;
    for (ListIterator<TextBlock> it = textBlocks.listIterator(1); it.hasNext();) {

To:
boolean changes = false;
do {
    changes = false;
    TextBlock prevBlock = textBlocks.get(0);
    for (ListIterator<TextBlock> it = textBlocks.listIterator(1); it.hasNext();) {

Original issue reported on code.google.com by aricbosc...@gmail.com on 6 Mar 2013 at 2:49