Closed SCdF closed 6 years ago
2.0 compatibility is being worked on, but a new release hasn't occurred yet: https://github.com/rnewson/couchdb-lucene/commit/d26bdf04d88e58745f21fd2cbdf44ac5293417ee
Good catch. We should consider migrating our couchdb-lucene code to mango which is effectively the same but natively supported.
@garethbowen really? Does that also apply to our free text search views as well? I didn't think Mango was this magical…
@SCdF Yes. I think mango will be a huge win for our freetext views which are quite crude at the moment. I believe mango is backed by lucene, but the real advantage to us is the pouchdb-find plugin which means we can use the same syntax client side and offline. NB: I haven't actually tried any of this yet...
Looking into how mango works, I'm not convinced we can use it for free text search. The only tool I see for finding things inside a string is regex, and regex is not allowed in indexes, and indexes are required for performance on large data sets.
I looked at both the pouchdb-find documentation and couchdb's find documentation.
This might be helpful: https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/
They talk about "full text search" and how they use lucene and so on, so I still feel like this should be possible...
Right, so that doesn't seem to be with Mango exactly, it's more JS. It's not completely clear how much of Cloudant made it into CouchDB 2.0. I'll give a crack at creating one of these indexes in a ddoc and see what happens…
OK so yeah I don't think that is in CouchDB 2.0, at least not by default. I followed the query examples in here: https://cloudant.com/for-developers/search/ . First by copying the first full ddoc they show and then using their url structure talked about later to query it, and got a 404.
Yeah I think you need to compile lucene compatibility in.
@garethbowen right, but if you need to compile lucene compatibility in, that means a) you need lucene installed on the server (+Java etc) and b) it won't work with PouchDB.
So I guess the question is, do we:
couchdb-lucene
, and make sure it still workscouchdb-lucene
and hand-compile CouchDB 2.0 for support it@garethbowen so I can see usages of _fti
(what we're binding couchdb-lucene to) in medic-api (and sentinel for some kind of validation?), I don't actually see it in medic-webapp at all, either directly or using the medic-api api. I know you said it was still used, for analytics or something like that right?
Let's talk monday.
It's used for ANC analytics which is online only and queries API controllers which do the FTI query. I'm sure we can replace it with queries to postgres if we want to make couch2pg a requirement. Or we can look at replacing it with nools based targets so it can work offline.
I'm not sure about that sentinel validation but it would be great to rewrite it using views.
There's also https://github.com/nolanlawson/pouchdb-quick-search for full text indexing on client side only. We could provide some wrapper service which detects whether you're querying couch or pouch and does a lucene or pouchdb-quick-search query respectively. It's a little messy but it gives us a way to get rid of our ugly and expensive *_by_freetext
views.
Things to do:
api/v1/fti
and how much pain we'd be in if we just dropped support for thatexport-data.js
(exposed by api/v1/export/..
) also exposes lucene queries as an api. Work out if that was a nice to have or we actually use it.+1 for nools-based targets replacing ANC analytics
+1 for nools-based targets replacing ANC analytics
We are doing a review of the current ANC widgets to see which ones would be needed. Some of the existing ones are no longer needed since we can make use of Tasks to highlight pregnant women that have missed their appointments. cc @diannakane
@abbyad / @diannakane it would also be good to know, since lucene only works online, which could be replaced by Klipfolio dashboards. Either directly in klipfolio or perhaps with some kind of iframe-style arrangement.
@abbyad we might still need some of those for projects that are purely SMS-based, no? Or are the tasks only for the nurses? Seems like a big change from the existing workflow - perhaps I am misunderstanding though as I haven't see what you guys have worked on so far.
So there is quite a bit of lucene going on in the API, and we should be solving things one step at a time. First we'll take Manhattan, then we'll get rid of lucene.
Looking into getting couchdb-lucene working on 2.0, I got it working (in the sense that it didn't crash, I didn't test it extensively) on 1.6. The main problem with 2.0 is that the _fti
endpoint is not exposed on 5984
-- the clustered endpoint-- but on 5986
, the single node endpoint.
We can:
_config
was). Note I've already checked _node/couchdb@localhost/_fti
and that doesn't work.I've got lucene working on medic-api and presumably medic-sentinel, though it's not actually clear how I'd trigger that code:
The grossness is that to do it we attach to the secret single node port, 5986
.
Note as well that people would probably currently need a version of couchdb-lucene that is not yet released. I tested it on the master branch of couchdb-lucene.
I guess it works. I'm as uncomfortable as you are with this. It's probably a blocker to actually going with couchdb2.
OK so shall we hold off merging this into the couchdb2.0 branch until:
mango can use map/reduce and search indexes (via clouseau/dreyfus), iirc there's an 'explain' endpoint within it that will tell you which indexes it is using for your find queries to it.
for couchdb-lucene, I think the only 2.0 issue is that it's not possible to add the _fti
endpoint, since couchdb 2.0 does not support that via the .ini files any longer. The replacement mechanism (couch_epi) requires adding erlang code. dreyfus does this (obviously) and constitutes an example to follow for couchdb-lucene.
@rnewson thanks for the comment!
As I understand it we'd need to compile in clouseau / dreyfus support, correct? I've been using the MacOS binary CouchDB provides, and AFAICT the only kind of indexes I can create are Mango queries themselves.
Ideally we'd like to remove Lucene completely (bye bye JVM) and be able to support full text search offline via Pouch. One thing at a time.
That's right, see;
https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/
and
https://cloudant.com/blog/open-sourcing-cloudant-search/
I hear you on the JVM. As far as I know there's nothing to touch Lucene in this space, though.
B.
On 2 Dec 2016, at 13:42, Stefan du Fresne notifications@github.com wrote:
@rnewson thanks for the comment!
As I understand it we'd need to compile in clouseau / dreyfus support, correct? I've been using the MacOS binary CouchDB provides, and AFAICT the only kind of indexes I can create are Mango queries themselves.
Ideally we'd like to remove Lucene completely (bye bye JVM) and be able to support full text search offline via Pouch. One thing at a time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I've picked this back up again. I'm trying to run the couchdb-lucene in brew (1.1.0), which hasn't had any explicit CouchDB 2.0 support, against our code.
I… think it works?
When I have it set up to point to 5986
(the CouchDB2.0 local port) and try to curl http://admin:pass@localhost:5988/api/active-pregnancies
(the easiest way I know to hit our lucene code) I get a bunch of stuff in the couchdb-lucene logs that looks like this:
2017-01-17 16:17:00,747 WARN [medic] 765E9F95-D966-CD48-990E-9E49F8BD2D49 caused TypeError: Expected argument of type object, but instead had type undefined (unnamed script#46)
Which I mean, I wouldn't be surprised if our schema wasn't particularly well formed, and more importantly above and below all of that I get:
2017-01-17 16:16:54,849 INFO [Config] Index output goes to: /usr/local/Cellar/couchdb-lucene/1.1.0/libexec/indexes
2017-01-17 16:16:54,880 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985
2017-01-17 16:17:00,320 INFO [medic] Indexing from update_seq start
...
2017-01-17 16:17:15,725 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 37500
2017-01-17 16:17:16,113 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 37500
2017-01-17 16:17:31,930 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 40960
2017-01-17 16:17:32,043 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 40960
Which sure sounds like it's correctly attaching to the _changes
feed and doing what it should be.
Now the first time you curl active-pregnancies medic-api returns a 500:
::1 - admin [17/Jan/2017:03:16:59 +0000] "GET /api/active-pregnancies HTTP/1.1" - - "-" "curl/7.51.0"
Server error:
{ code: 500 }
But that's not massively helpful, and there is no indication it's couchdb-lucene's fault. It may be that couchdb-lucene returns those warnings, or it's timing out because it has to first build the index, or something like that.
If I run it again it "successfully" returns 0 active pregnancies without error. Scare quotes because, well, this stuff is still based on 0.4 schemas and what-not I imagine, so I'm not even really sure how it's supposed to reply. There may be other issues (like it looking for the incorrect style of data_record), but I'm not convinced it's because of CouchDB 2.0.
I'm going to try to understand how it works out what a pregnancy is and if it's active, so I can try to get it into a state where it should (and then hopefully does) return useful data.
couchdb 2.0's local port 5986 is not what you want, database you make there will be unclustered. couchdb-lucene's master branch is now updated to be compatible with couchdb 2.0. I should really cut a point release...
B.
On 17 Jan 2017, at 03:28, Stefan du Fresne notifications@github.com wrote:
I've picked this back up again. I'm trying to run the couchdb-lucene in brew (1.1.0), which hasn't had any explicit CouchDB 2.0 support, against our code.
I… think it works?
When I have it set up to point to 5986 (the CouchDB2.0 local port) and try to curl http://admin:pass@localhost:5988/api/active-pregnancies (the easiest way I know to hit our lucene code) I get a bunch of stuff in the couchdb-lucene logs that looks like this:
2017-01-17 16:17:00,747 WARN [medic] 765E9F95-D966-CD48-990E-9E49F8BD2D49 caused TypeError: Expected argument of type object, but instead had type undefined (unnamed script#46)
Which I mean, I wouldn't be surprised if our schema wasn't particularly well formed, and more importantly above and below all of that I get:
2017-01-17 16:16:54,849 INFO [Config] Index output goes to: /usr/local/Cellar/couchdb-lucene/1.1.0/libexec/indexes 2017-01-17 16:16:54,880 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985 2017-01-17 16:17:00,320 INFO [medic] Indexing from update_seq start ... 2017-01-17 16:17:15,725 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 37500 2017-01-17 16:17:16,113 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 37500 2017-01-17 16:17:31,930 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 40960 2017-01-17 16:17:32,043 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 40960
Which sure sounds like it's correctly attaching to the _changes feed and doing what it should be.
Now the first time you curl active-pregnancies medic-api returns a 500:
::1 - admin [17/Jan/2017:03:16:59 +0000] "GET /api/active-pregnancies HTTP/1.1" - - "-" "curl/7.51.0" Server error: { code: 500 }
But that's not massively helpful, and there is no indication it's couchdb-lucene's fault. It may be that couchdb-lucene returns those warnings, or it's timing out because it has to first build the index, or something like that.
If I run it again it "successfully" returns 0 active pregnancies without error. Scare quotes because, well, this stuff is still based on 0.4 schemas and what-not I imagine, so I'm not even really sure how it's supposed to reply. There may be other issues (like it looking for the incorrect style of data_record), but I'm not convinced it's because of CouchDB 2.0.
I'm going to try to understand how it works out what a pregnancy is and if it's active, so I can try to get it into a state where it should (and then hopefully does) return useful data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@rnewson cheers for the reply.
To make sure I understand what you're saying: if I run couchdb-lucene from master I can, with the standard CouchDB2 package (no recompilation), have couchdb-lucene.ini
's [local]
url pointing to the cluster port? e.g.
[local]
url = http://admin:pass@localhost:5984
Or whatever? I had got the impression somehow that we'd need to recompile CouchDB2 to support that kind of thing, but I may have gotten confused!
Yes that's right. The two programs communicate solely via http so there's no recompile step.
Sent from my iPhone
On 18 Jan 2017, at 00:54, Stefan du Fresne notifications@github.com wrote:
@rnewson cheers for the reply.
To make sure I understand what you're saying: if I run couchdb-lucene from master I can, with the standard CouchDB2 package (no recompilation), have couchdb-lucene.ini's [local] url pointing to the cluster port? e.g.
[local] url = http://admin:pass@localhost:5984 Or whatever? I had got the impression somehow that we'd need to recompile CouchDB2 to support that kind of thing, but I may have gotten confused!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
You know, looking at this again, I'm not actually sure why we use the proxy url, considering the only user of of couchdb-lucene is local to lucene anyway (api). We do not have use cases where someone wants to directly run lucene queries from a client / browser / other server.
I'm going to change this so we just use lucene's port and then I think we're done.
PRs:
@garethbowen plz review. Note: this still requires using the master build of couchdb-lucene. I'll create a ticket for that in the repo. @browndav is it an imposition to build from master in medic-os? How are you actually pulling in couchdb-lucene right now?
Merged into couchdb 2.0 branch
do we need to AT this @SCdF ?
@ngaruko yeah I think so. Specifically, a deployment of our app using CouchDB 2.0 should work properly with how we use Lucene (we use it to generate certain export stats and the like). However, this is a thing we can do in general when we're ready to support CouchDB 2.0 in production.
http://admin:pass@localhost:5988/api/active-pregnancies
Is an example of one of our endpoints that requires Lucene. In terms of ATing that this works you'd have to talk to someone other than me sorry. Our use of Lucene is not something we deal with actively and is before my time. Marc or Sharon may know more.
@ngaruko assigning to you for AT.
I am getting 'Ok' response and no errors when I run curl -i -H "Accept: application/json" -H "Co application/json" -X GET http://admin:pass@localhost:5988/api/active-pregnancies
.
Will close that for now.
Hey @ngaruko, this is roughly the same level of AT I did when I was getting the connection up and running. See my comment here: https://github.com/medic/medic-webapp/issues/2935#issuecomment-273013508
My hope is that this would be AT'd in more depth, to prove that the feature actually works (ie it actually returned active pregnancies, for example), as opposed to just showing that the lucene integration throws no errors.
Is it possible for you to do that? You may need to talk with people to learn how this feature works.
Do we still have to worry about this now that lucene is on the 'death row'?
IMO we are going to be releasing 3.0 before we have removed Lucene completely, so we need to support it.
However, @garethbowen may have some alternative facts to support his alternative opinion ;-)
Removing the rest of lucene is scheduled for 3.0.0 currently. It may not make the cut so it's probably worth testing properly.
LGTM
Work out if couchdb-lucene works on CouchDB 2.0.