liufengyun / hashdiff

Hashdiff is a ruby library to to compute the smallest difference between two hashes
MIT License
559 stars 63 forks source link

HashDiff never finishes when comparing two arrays #49

Open nbarrientos opened 5 years ago

nbarrientos commented 5 years ago

Hi,

When debugging an issue with octocatalog-diff we narrowed down the problem to HashDiff unable to compare a couple of arrays when LCS is used. Here's a reproducer:

require 'hashdiff'

a = {x: []}
(0...17000).each{ |x|
        a[:x].push((0...11).map { ('a'..'z').to_a[rand(26)] }.push('.example.org').join)
        }

b = {x: a[:x]}

puts "Without LCS"
diff = HashDiff.diff(a, b, :use_lcs => false)
puts diff
puts "Done!"
puts "With LCS"
diff = HashDiff.diff(a, b)
puts diff
puts "This should never be printed"

When comparing those two arrays using LCS (which is on by default) the application starts to eat up memory very quickly with 100% CPU usage until, in our case, the process is killed by the kernel's OOM killer as the test machine does not have any swap:

$ ruby reproducer_synth.rb 
Without LCS
Done!
With LCS
...
Killed
kernel: Out of memory: Kill process 18783 (ruby) score 526 or sacrifice child
kernel: Killed process 18783 (ruby) total-vm:1192540kB, anon-rss:1013324kB, file-rss:0kB, shmem-rss:0kB

Yes, that's >1GiB!

# ruby --version
ruby 2.0.0p648 (2015-12-16) [x86_64-linux]
# gem list hashdiff

*** LOCAL GEMS ***

hashdiff (0.3.8)
# 

I understand that LCS is O(n2) but not sure this is the expected behaviour as the array size is moderate.

liufengyun commented 5 years ago

Thanks a lot for reporting the issue @nbarrientos . The memory behavior is to some extent interpretable, as 17000 * 17000 * 4.0 / 1024 / 1024 / 1024 = 1.07GB. A note should be added to readme that the lib is not situable for computing arrays larger than 10000 elements.

As far as I know, in genetics there are a lot of clever algorithms to avoid such explosion problems. But I am unaware of simple algorithms that do not resort to external storage.

nbarrientos commented 5 years ago

Hi @liufengyun!

Thanks for replying.

Sorry, when I read the documentation of the library I did see indeed that the complexity of the algorithm was O(n2), but I naively assumed that it was the time complexity. I should have read further about the space complexity :smile:

For the fun of it, I allowed a problem of input size 10000 (diffing two arrays of 10000 1-byte elements) to consume as much memory as it needed until it finished. Here's the memory footprint of the process.

image

Same plot but increasing the size to 15000:

image

I agree that perhaps adding a note clarifying this is a good idea, especially taking into account that LCS is enabled by default.

Bonne journée! :wave:

liufengyun commented 5 years ago

@nbarrientos Cool benchmarks 👍 Merci beaucoup et bonne journée!