Open dvogel opened 12 years ago
Hi Drew,
yep, this isn't implemented yet, only POST currently supported:
https://github.com/mediastandardstrust/superfastmatch/blob/master/src/worker.cc#L280-292
I haven't really come up with a good answer as what precisely should be returned by a GET yet. Any suggestions?
Cheers, Donny.
I submitted the issue since the /association/ GET was in the documentation but I'm not sure it needs to exist. I'm having trouble thinking of a use case for it.
Hi again!
Well, it might be nice to browse associations in order such as by longest fragment, but this could be facilitated by metadata on a document. Another idea is to have a graphviz or d3.js json output of the top N associations. Not sure really, but it is definitely the interesting data!
I keep meaning to start the documentation panel. Ideally with some kind of interactivity so you can explore the API. Do you have any recommendations for CSS presentation of a REST API?
Cheers, Donny.
Hey y'all. I was about to ask about /associations/ What's the status of this feature? I was looking for a good way to find clusters of similar documents (our data has lots of duplicates, near matches, etc.). I was hoping /associations/ would do this, but alas...
Great, great work by the way!
Hi!
I'll point you in the direction of the Go version of superfastmatch which will see some updates related to associations soon:
https://github.com/donovanhide/superfastmatch-go
Be warned, it's still alpha quality! Out of interest, what corpus are you working with?
Cheers, Donovan,
Awesome! I'll look at that. Do you think that /associations/ would help this clustering or am I misunderstanding what associations are?
Right now, I'm just using SuperFastMatch to look at text reuse in online journalism (usually from AP/Reuters). I have about 20000 articles over a week. I wrote my own program to do this (shingling, minhashing and LSH), but yours is so much faster and you already have the fragment extraction which will help on the front end. Brilliant.
@sunlightlabs uses this version of superfastmatch to do clustering for detecting duplicate political ads for our adhawk app. Since the associations aren't directly accessible, he iterated over every document in superfastmatch and saved a list of the nearest neighbors (using combined length of all fragments shared as the distance). If you're using python, our python-superfastmatch library has an iterator class to make iterating over the documents easy.
Key to this approach is first running the batch association task. Once that task is complete, retrieving a document returns a list of associated documents. That database has < 10k documents though so iteration is fairly quick. For our Churnalism app we have over a million documents and iteration is definitely slow (~10 minutes on a beefy machine). I'd be interested to see how it scales up to your corpus size.
@dvogel Thanks a lot for the reply! I'll take a look at the python library. Very cool.