HTML Diff produces an OOM Exception

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Run a diff using HTMLDiffer on the attached files.
2. HTML diff will produce an OOM Exception
3. Runtime Exception is thrown

What is the expected output? What do you see instead?
An accurate diff report is expected, or at least something to preempt the
OOM Exception.

What version of the product are you using? On what operating system?
Version 1.1, Linux Redhat 5 64 bit, JDK 6, 2GB heap size.

Please provide any additional information below.
We've got a rich text editor that allows users to paste html directory into
an html tab.  The user in this case pasted html into the rich text tab as
opposed to the html tab which allows for direct html input.  The result is
the html is escaped in one version (file2.txt), and not escaped in the
next(file1.txt).  The text is the same in both versions, but the diff
contains too many elements that are different, and receives the following
exception:

Caused by: java.lang.OutOfMemoryError: Java heap space at
org.eclipse.compare.rangedifferencer.OldDifferencer.findDifferences(Unknown
Source)
at
org.eclipse.compare.rangedifferencer.RangeDifferencer.findDifferences(Unknown
Source)
at
org.eclipse.compare.rangedifferencer.RangeDifferencer.findDifferences(Unknown
Source)
at org.outerj.daisy.diff.html.HTMLDiffer.diff(Unknown Source)

Original issue reported on code.google.com by mccullough.todd on 17 Mar 2010 at 4:14

GoogleCodeExporter commented 8 years ago

Adding the correct files.

Original comment by mccullough.todd on 17 Mar 2010 at 5:10

Attachments:

GoogleCodeExporter commented 8 years ago

Confirmed on DaisyDiff 1.0 and 1.1. While the input files are themselves big, 
it could 
also be a memory leak. Notice however that the error is mentioned in Eclipse 
Code and 
not Daisy Diff. I do not have enough knowledge of the Eclipse differ to look 
into this 
(if indeed this is the problem).

Original comment by kkape...@gmail.com on 17 Mar 2010 at 5:14

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Diffing is quadratic in the size of the documents. In DaisyDiff, this is the 
number
of words. The escaped document looks to have very may words, and it doesn't 
surprise
me that it's intractible to diff these documents.
Ofcourse, this doesn't prove that there isn't a memory leak.

Original comment by guy...@gmail.com on 17 Mar 2010 at 5:50

GoogleCodeExporter commented 8 years ago

Thanks for the insight.  In the short term, I've been trying to determine a way 
to opt 
out of running a diff all together if certain conditions are met.  It's 
figuring out 
the conditions that's the hard part.  It doesn't surprise me that this fails 
either, 
but it doesn't seem to be the size of the html at all, just the number of 
different 
elements.  "Normal" diff's between versions of the html that don't have the 
html 
escape characters works nicely, regardless of the size of the doc.

Original comment by mccullough.todd on 17 Mar 2010 at 6:18

seanshou / daisydiff

HTML Diff produces an OOM Exception #21