mediastandardstrust / superfastmatch

A tool for bulk text comparison and analysis
http://superfastmatch.org
Other
119 stars 10 forks source link

500 on /association/ #2

Open dvogel opened 12 years ago

dvogel commented 12 years ago

- About to connect() to localhost port 8080 (#0)
-   Trying 127.0.0.1... connected
- Connected to localhost (127.0.0.1) port 8080 (#0)
  > GET /association/ HTTP/1.1
  > User-Agent: curl/7.21.0 (x86_64-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.15 libssh2/1.2.6
  > Host: localhost:8080
  > Accept: _/_
  > 
  < HTTP/1.1 500 Internal Server Error
  < Server: KyotoTycoon/0.9.52
  < Date: Fri, 03 Feb 2012 16:20:15 GMT
  < Content-Length: 0
  < X-Response-Time: 0.0000 secs
  < Content-Type: text/html
  < 
- Connection #0 to host localhost left intact
- Closing connection #0```

This occurs both with and without data (IOW before and after running with -reset)
donovanhide commented 12 years ago

Hi Drew,

yep, this isn't implemented yet, only POST currently supported:

https://github.com/mediastandardstrust/superfastmatch/blob/master/src/worker.cc#L280-292

I haven't really come up with a good answer as what precisely should be returned by a GET yet. Any suggestions?

Cheers, Donny.

dvogel commented 12 years ago

I submitted the issue since the /association/ GET was in the documentation but I'm not sure it needs to exist. I'm having trouble thinking of a use case for it.

donovanhide commented 12 years ago

Hi again!

Well, it might be nice to browse associations in order such as by longest fragment, but this could be facilitated by metadata on a document. Another idea is to have a graphviz or d3.js json output of the top N associations. Not sure really, but it is definitely the interesting data!

I keep meaning to start the documentation panel. Ideally with some kind of interactivity so you can explore the API. Do you have any recommendations for CSS presentation of a REST API?

Cheers, Donny.

vvisigoth commented 11 years ago

Hey y'all. I was about to ask about /associations/ What's the status of this feature? I was looking for a good way to find clusters of similar documents (our data has lots of duplicates, near matches, etc.). I was hoping /associations/ would do this, but alas...

Great, great work by the way!

donovanhide commented 11 years ago

Hi!

I'll point you in the direction of the Go version of superfastmatch which will see some updates related to associations soon:

https://github.com/donovanhide/superfastmatch-go

Be warned, it's still alpha quality! Out of interest, what corpus are you working with?

Cheers, Donovan,

vvisigoth commented 11 years ago

Awesome! I'll look at that. Do you think that /associations/ would help this clustering or am I misunderstanding what associations are?

Right now, I'm just using SuperFastMatch to look at text reuse in online journalism (usually from AP/Reuters). I have about 20000 articles over a week. I wrote my own program to do this (shingling, minhashing and LSH), but yours is so much faster and you already have the fragment extraction which will help on the front end. Brilliant.

dvogel commented 11 years ago

@sunlightlabs uses this version of superfastmatch to do clustering for detecting duplicate political ads for our adhawk app. Since the associations aren't directly accessible, he iterated over every document in superfastmatch and saved a list of the nearest neighbors (using combined length of all fragments shared as the distance). If you're using python, our python-superfastmatch library has an iterator class to make iterating over the documents easy.

Key to this approach is first running the batch association task. Once that task is complete, retrieving a document returns a list of associated documents. That database has < 10k documents though so iteration is fairly quick. For our Churnalism app we have over a million documents and iteration is definitely slow (~10 minutes on a beefy machine). I'd be interested to see how it scales up to your corpus size.

vvisigoth commented 11 years ago

@dvogel Thanks a lot for the reply! I'll take a look at the python library. Very cool.