hypothesis / h

Annotate with anyone, anywhere.
https://hypothes.is/
BSD 2-Clause "Simplified" License
2.95k stars 427 forks source link

Streams + Search #719

Closed dwhly closed 10 years ago

dwhly commented 11 years ago

We need to bring faceted search to streams. And potentially invoke specific streams (user, tab, etc) as faceted searches of the firehose.

As part of this, we need to add a new facet, possibly "url" or "site" which would allow targeting a certain page as part of a search of a stream.

Potentially with wildcards? i.e. nytimes.com/*

Search of a stream should be able to be invoked via parameters in the url, which then would populate as facets in the search bar of the resulting stream.

A short spec will follow.

Some discussion here: https://groups.google.com/forum/#!topic/hypothesis-forum/wbViiMXR8Kg

dwhly commented 11 years ago

Spec: https://docs.google.com/a/hypothes.is/document/d/1zW7jKC8pRuBSDmDK-pZeUZi0e-7T9NrwErthYKpJb0I/edit

gergely-ujvari commented 11 years ago

I'll use this url as a demo for the feature: http://23.21.26.107:8080/streamsearch/ It is not yet finished but you can always watch the progress.

Tasks to be done:

dwhly commented 11 years ago

Gergely: Can you change the failed match text to this: "No results were found. Listening for new matches..."

gergely-ujvari commented 11 years ago

@dwhly: Of course. Will be in the next demo update.

gergely-ujvari commented 11 years ago

I have a real tough issue here.

If we want to make a real search we should be able search for any substring for some search facets (i.e. annotation text). Now this is working great for the live update, but not very good for the past data.

The problem is that for ElasticSearch you cannot search for any substring freely if you haven't set up the correct index analyzer beforehand. Originally we were using the default analyzer for the 'text' field, it means that it has tokenized the text with whitespaces and what you could search for was whole words.

So I you had a text like this: "This is my own rocking annotation text!" Then searching for the string 'rocking' would give the annotation back as a hit, but when searching for 'his' or 'kin' or 'tation' will not give back this annotation as a result, because you were looking for a string fragment, not a whole word.

Now, but I've done that I've set up an nGram(1,10) analyzer for our 'text' field (this is the version in the current demo server) which generates all possible substring token between the size of 1 and 10 from every saved annotation text and now this gives back word fragment results too.

But the problem is that it takes forever to save a larger annotation and our system throws the following error when trying to save a longer annotation:

2013-09-13 15:03:26 [4096] [CRITICAL] WORKER TIMEOUT (pid:4109)
2013-09-13 15:03:26,207 CRITI [gunicorn.error] WORKER TIMEOUT (pid:4109)
2013-09-13 15:03:26 [4096] [CRITICAL] WORKER TIMEOUT (pid:4109)
2013-09-13 15:03:26,214 CRITI [gunicorn.error] WORKER TIMEOUT (pid:4109)

and maybe we'd even need a larger nGram. Of course, once saved the ES gives back the results quite quickly.

@tilgovi: Where is this worker timeout is set?

So anybody knows a better analyzer for this kind of problem? I feel that giving back no result for word fragments is not a good way to go for, but ES really overcomplicates this. (This would not even be an issue in any other db)

gergely-ujvari commented 11 years ago

I've forgotten the links: Here is our nGram analyzer: https://github.com/hypothesis/annotator-store/blob/719-streamsearch/annotator/annotation.py#L93

Another question: Do we know how to rebuild an index without losing the already stored data in ES?

dwhly commented 11 years ago

I'll just point out (the obvious) that a key need for substrings is on the URI matching. To be able to support matching nytimes.com as a substring of www.nytimes.com/2013/08/09/science/internet-study-finds-the-persuasive-power-of-like.html

csillag commented 11 years ago

What does ES give us that SQL does not? We used to have an experimental postgres back end....

tilgovi commented 11 years ago

On Sep 13, 2013 6:23 AM, "gergely-ujvari" notifications@github.com wrote:

I've forgotten the links: Here is our nGram analyzer: https://github.com/hypothesis/annotator-store/blob/719-streamsearch/annotator/annotation.py#L93

Another question: Do we know how to rebuild an index without losing the already stored data in ES?

Yes. But we can't do it without down time yet. There's an open issue from me on annotator store for this.

tilgovi commented 11 years ago

On Sep 13, 2013 9:41 AM, "Dan Whaley" notifications@github.com wrote:

I'll just point out (the obvious) that a key need for substrings is on the URI matching. To be able to support matching nytimes.com as a substring of www.nytimes.com/2013/08/09/science/internet-study-finds-the-persuasive-power-of-like.html

Right. I would go as far as to question whether we need fragments at all for other fields. We might also get away with just stemming ("run" as a token for "running").

dwhly commented 11 years ago

I agree that arbitrary substrings for other fields are probably not a near term need. Most of the time, whole words are probably fine, or normal stem support as you mention.

For URIs, completely arbitrary substrings are probably also not needed. I would imagine most URI searches would be either domain, domain+path or the fully specified page. Paths would probably only break at discrete subtrees, i.e. at the '/'. Domains would usually be the whole host.domain.tld, or domain.tld, but in every case probably broken at '.'.

Does this make it any easier?

tilgovi commented 11 years ago

On Sep 13, 2013 9:56 AM, "Kristof Csillag" notifications@github.com wrote:

What does ES give us that SQL does not? We used to have an experimental postgres back end....

Schemaless: New fields can be added without expensive schema alterations.

JSON: The documents round trip as JSON out of the box, which is very easy for us to consume.

Analyzers: all the advanced analyzers we're discussing here

Sharding and high availability with no additional operational tooling or application code

With postgres we'd have to experiment with using the relatively recently added JSON type if we wanted to keep the schema flexibility, maybe materializing views extracted from these. Otherwise we need to be strict about every field we expect.

IIRC the full text search for PG is much less powerful, although there is a very efficient prefix (trie) index format.

On the other hand, relational modeling is pretty cool for some things, but I don't see our data as highly relational anyway. We have thread hierarchies, ACLs, users, and document metadata. But I think most of this actually makes sense denormalized to me, especially in some ambitious federation stories where, for example, the user appearing in the ACL is otherwise unknown to us.

I see little advantage and many questions from switching. That is why I did not push for Gergeley to continue working on that experiment.

tilgovi commented 11 years ago

There is a path hierarchy tokenizer in elastic search exactly for this. On Sep 13, 2013 10:33 AM, "Dan Whaley" notifications@github.com wrote:

I agree that arbitrary substrings for other fields are probably not a near term need. Most of the time, whole words are probably fine, or normal stem support as you mention.

For URIs, completely arbitrary substrings are probably also not needed. I would imagine most URI searches would be either domain, domain+path or the fully specified page. Paths would probably only break at discrete subtrees, i.e. at the '/'. Domains would usually be the whole host.domain.tld, or domain.tld, but in every case probably broken at '.'.

Does this make it any easier?

— Reply to this email directly or view it on GitHubhttps://github.com/hypothesis/h/issues/719#issuecomment-24411072 .

tilgovi commented 11 years ago

On Sep 13, 2013 6:21 AM, "gergely-ujvari" notifications@github.com wrote:

and maybe we'd even need a larger nGram. Of course, once saved the ES gives back the results quite quickly.

@tilgovi: Where is this worker timeout is set?

Gunicorn command line or gunicorn.conf.py.

But we should not change it. 30s to write is unacceptable. nGram(1,10) is far too ambitious.

csillag commented 11 years ago

Couchd back end? It has most of the benefits listed for ES. 2013.09.13. 19:36, "Randall Leeds" notifications@github.com ezt írta:

There is a path hierarchy tokenizer in elastic search exactly for this. On Sep 13, 2013 10:33 AM, "Dan Whaley" notifications@github.com wrote:

I agree that arbitrary substrings for other fields are probably not a near term need. Most of the time, whole words are probably fine, or normal stem support as you mention.

For URIs, completely arbitrary substrings are probably also not needed. I would imagine most URI searches would be either domain, domain+path or the fully specified page. Paths would probably only break at discrete subtrees, i.e. at the '/'. Domains would usually be the whole host.domain.tld, or domain.tld, but in every case probably broken at '.'.

Does this make it any easier?

— Reply to this email directly or view it on GitHub< https://github.com/hypothesis/h/issues/719#issuecomment-24411072> .

— Reply to this email directly or view it on GitHubhttps://github.com/hypothesis/h/issues/719#issuecomment-24411272 .

tilgovi commented 11 years ago

On Sep 13, 2013 11:18 AM, "Kristof Csillag" notifications@github.com wrote:

Couchd back end? It has most of the benefits listed for ES.

Some time in the next year or so it will have the built in clustering from BigCouch but it doesn't yet.

It also doesn't have built in search at all, although Cloudant provides it via CouchDB-Lucene and there's a really good ES river plugin.

IMO CouhDB is not an adequate replacement but it is sane to consider primary storage in Couch and index-only search in ES. The Couch layer provides some nice, incremental firehose features. The most promising thing about that to me has always been the possibility of truly offline annotation with PouchDB. Most of what we need can be done with ephemeral pubsub, such as we could get from redis. It's also possible that Kafka+ElasticSearch is a better fit. Or others.

I'm in no rush to move off ElasticSearch as I don't see any immediate benefit.

Meanwhile, others are already working on annouch ( https://github.com/vmx/annouch) so if anyone wants to lend a hand there we can see what happens as that matures.

My suspicion is that it could be a great way to set up a self contained annotation server for the indie web dev but our needs are likely to stay ahead of it. Once you need more than just a couch the benefit of the couchapp, self-containment, is diminished.*

gergely-ujvari commented 11 years ago

@tilgovi : I see the open issue for reindexing, but does that mean that with our current mapping we cannot do this? (Because currently we don't have any aliases.)

And yes, if the timeout is set to 30s we don't have to increase it anymore. Thanks for the analyzer I'll test it.

gergely-ujvari commented 11 years ago

@dwhly: Yes, it absolutely makes it easier, I only want to add that if we can search for whole words only, it is far from being intuitive. (IMHO) I mean I came to debug this issue by searching for simple wordphrases which are meaningful enough for me. (i.e if I want to search for the world "beautiful" maybe just looking for "beau" is enough and it is less typing)

So maybe we should warn to users that he can only search for words, not fragments or phrases. I'll make a test with the edgeNGram analyzer (http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer/) which would allow us to look for beginnings/endings of the words. Definitely faster than the normal nGram.

gergely-ujvari commented 11 years ago

It seems there is something wrong with the path_tokenizer. I tried to use it, but the index building just froze during the boot. After ES and pyes upgrade (to 0.90.3 from 0.90.2 and 0.20.1 from 0.19.1) it throws back the following error:

 File "/home/ujvari/git12/h/lib/python2.7/site-packages/pyes/decorators.py", line 46, in __inner
    return fun(*args, **kwargs)
  File "/home/ujvari/git12/h/lib/python2.7/site-packages/pyes/es.py", line 817, in put_mapping
    return self.indices.put_mapping(doc_type=doc_type, mapping=mapping, indices=indices)
  File "/home/ujvari/git12/h/lib/python2.7/site-packages/pyes/managers.py", line 416, in put_mapping
    return self.conn._send_request('PUT', path, mapping)
  File "/home/ujvari/git12/h/lib/python2.7/site-packages/pyes/es.py", line 407, in _send_request
    raise_if_error(response.status, decoded)
  File "/home/ujvari/git12/h/lib/python2.7/site-packages/pyes/convert_errors.py", line 93, in raise_if_error
    raise exceptions.ElasticSearchException(error, status, result, request)
pyes.exceptions.ElasticSearchException: ClassNotFoundException[org.elasticsearch.index.analysis.pathhierarchy.PathHierarchyTokenFilterFactory]; 
gergely-ujvari commented 11 years ago

I've opened an issue upstream: https://github.com/elasticsearch/elasticsearch/issues/3695

gergely-ujvari commented 11 years ago

I've updated the test server using edgeNGram for query and text and it seems to be working fine even with quite long annotations. The search for text and quotes works like charm. I think after a bit testing we can leave edgeNGram.

gergely-ujvari commented 11 years ago

I've added the load more past data feature by scrolling down. (To test it either we need to have many-many annotation or set the limit to low)

This way only the uri tokenizer problem remains and a big test for this branch.

tilgovi commented 11 years ago

On Sep 14, 2013 1:23 AM, "gergely-ujvari" notifications@github.com wrote:

@tilgovi : I see the open issue for reindexing, but does that mean that with our current mapping we cannot do this? (Because currently we don't have any aliases.)

Not without downtime.

gergely-ujvari commented 11 years ago

I've made the redirections and the old streamer cleanup. So this is the current endpoint: http://23.21.26.107:8080/stream/

user-streams are redirected from /u/(user) to /stream/#?user=(user) tag-streams are redirected from /t/(tag) to /stream/#?tags=(tag)

gergely-ujvari commented 10 years ago

I think we have pretty much done this, so closing.