rufuspollock-okfn / reconcile-csv

A simple OpenRefine reconciliation service that runs on top of a CSV file
BSD 2-Clause "Simplified" License
117 stars 28 forks source link

Refine hangs up on Reconcile (Question) #14

Open DavidRLowe opened 9 years ago

DavidRLowe commented 9 years ago

I've successfully initiated Reconcile CSV, but Refine tells me it's at 0% for about three hours now. I'm checking about 180,0000 names against about 220,000. Should I assume that Refine has fatally hung up, or is it just going to take a day or more to work through the data? Thanks, & I'm excited to get this thing working! david

DavidRLowe commented 9 years ago

Eh, 20 hours later, it's at 4%... Will try chopping it down into smaller bites. But it's working, that's good.

mihi-tr commented 9 years ago

This will take a long time. If you can move to a faster machine (or one with more cores) you should see a speed-up.

cyniphile commented 9 years ago

Just curious, but the time complexity should be O(N^2) right? Each element has some constant time operation run on each other element. It seems like it's actually something slower given how reconcile-csv is responding to various sizes of input. I'd like to make a runtime estimate for a very large dataset. Any advice here?

mihi-tr commented 9 years ago

Good question. I do think It should be O(N^2) - if you're just comparing one column. Actually it could be a bit faster due to memoization.

It could very well be that the memoization is actually killing you in larger datasets - you'll probably run into swapping which then brings you to a whole different level of slowness.

180,000 lines could be very well along these lines -

If you want to dabble in the code I'm sure there are optimizations to be made: e.g. right now it simply maps scores over the whole array of 180.000 lines passing each of them along in a larger memory structure before sorting and taking only the top 5 or so. Depending on the complexity of clojures sort-by it could be more efficient to sort during the map and keep the array smaller, just conj and sort in case there is a higher score. This way smaller scored arrays will be passed along - this could lead to a speed up. (it's done here https://github.com/okfn/reconcile-csv/blob/master/src/reconcile_csv/core.clj#L112).

mihi-tr commented 9 years ago

Thinking quite a bit about this: The sort only happens once, so even if it is slower the performance gain from working around will not be too much. I do think the performance loss is in the many thousand fuzzy comparisons. Started to work on fuzzy-string to make it more idiomatic and maybe even faster.

Updated the dependencies to a newer version in my fork. That might speed things up a bit.

Otherwise Lots of lines will always take lots of time. As said: try to increase cores (up to 10, since there are 10 requests sent out as a bundle by refine)

mihi-tr commented 9 years ago

Ok. The sort-map thing didn't work out - was significantly slower than the approach I used.

Switched to the new version of fuzzy-string and got a 4 time speedup. The fuzzy matching was very inefficient...

Just preparing version 0.1.2 of reconcile-csv if you are having performance issues: switch..

cyniphile commented 9 years ago

Awesome, I'll give it a shot. And thanks for looking into it. A very useful tool.

-Sent from my phone On Jul 3, 2015 12:13 PM, "Michael Bauer" notifications@github.com wrote:

Ok. The sort-map thing didn't work out - was significantly slower than the approach I used.

Switched to the new version of fuzzy-string and got a 4 time speedup. The fuzzy matching was very inefficient...

Just preparing version 0.1.2 of reconcile-csv if you are having performance issues: switch..

— Reply to this email directly or view it on GitHub https://github.com/okfn/reconcile-csv/issues/14#issuecomment-118384992.

DavidRLowe commented 9 years ago

Great! I'm giving 0.1.2 a try now (and only throwing 1,000 names at a time at my master list of 180,000).

tfmorris commented 9 years ago

O(N^2) doesn't scale (obviously). In the clustering algorithms we use blocking, but for a reconciliation service a more standard approach would be to index in some way. Simhash might be one algorithm to look at to help prune the set that you need to do O(N^2) on. Another thing to look at would be early exit options for your scoring. If you only want the max/min score, once you've exceeded the current best score, you can abandon the computation.

mihi-tr commented 9 years ago

@tfmorris thanks for the pointer to simhash.

Also the escaping on maximum score is a great pointer. However a lot of values will not hit the maximum - so the effect might not be worth the overhead... I'll test this.

mihi-tr commented 9 years ago

@tfmorris toyed with simhash yesterday. It is very appealing but patented. Do you know whether there is another similar algorithm that's free?

tfmorris commented 9 years ago

There are a number of open source implementations of SimHash and I've never heard of Google enforcing the patent, but if you want to look at alternatives, a predecessor/competitor to it is MinHash, although it's likely to be patented as well. There are some other alternatives listed here: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

thulasirangan commented 8 years ago

Hi, I am able to connect to the service, but it keeps working and working and no results. image Am I missing something?