medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
469 stars 217 forks source link

CouchDB2.0 and Lucene, does that all still work? #2935

Closed SCdF closed 6 years ago

SCdF commented 8 years ago

Work out if couchdb-lucene works on CouchDB 2.0.

SCdF commented 8 years ago

2.0 compatibility is being worked on, but a new release hasn't occurred yet: https://github.com/rnewson/couchdb-lucene/commit/d26bdf04d88e58745f21fd2cbdf44ac5293417ee

garethbowen commented 8 years ago

Good catch. We should consider migrating our couchdb-lucene code to mango which is effectively the same but natively supported.

SCdF commented 8 years ago

@garethbowen really? Does that also apply to our free text search views as well? I didn't think Mango was this magical…

garethbowen commented 8 years ago

@SCdF Yes. I think mango will be a huge win for our freetext views which are quite crude at the moment. I believe mango is backed by lucene, but the real advantage to us is the pouchdb-find plugin which means we can use the same syntax client side and offline. NB: I haven't actually tried any of this yet...

SCdF commented 7 years ago

Looking into how mango works, I'm not convinced we can use it for free text search. The only tool I see for finding things inside a string is regex, and regex is not allowed in indexes, and indexes are required for performance on large data sets.

I looked at both the pouchdb-find documentation and couchdb's find documentation.

garethbowen commented 7 years ago

This might be helpful: https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/

They talk about "full text search" and how they use lucene and so on, so I still feel like this should be possible...

SCdF commented 7 years ago

Right, so that doesn't seem to be with Mango exactly, it's more JS. It's not completely clear how much of Cloudant made it into CouchDB 2.0. I'll give a crack at creating one of these indexes in a ddoc and see what happens…

SCdF commented 7 years ago

OK so yeah I don't think that is in CouchDB 2.0, at least not by default. I followed the query examples in here: https://cloudant.com/for-developers/search/ . First by copying the first full ddoc they show and then using their url structure talked about later to query it, and got a 404.

garethbowen commented 7 years ago

Yeah I think you need to compile lucene compatibility in.

SCdF commented 7 years ago

@garethbowen right, but if you need to compile lucene compatibility in, that means a) you need lucene installed on the server (+Java etc) and b) it won't work with PouchDB.

So I guess the question is, do we:

SCdF commented 7 years ago

@garethbowen so I can see usages of _fti (what we're binding couchdb-lucene to) in medic-api (and sentinel for some kind of validation?), I don't actually see it in medic-webapp at all, either directly or using the medic-api api. I know you said it was still used, for analytics or something like that right?

Let's talk monday.

garethbowen commented 7 years ago

It's used for ANC analytics which is online only and queries API controllers which do the FTI query. I'm sure we can replace it with queries to postgres if we want to make couch2pg a requirement. Or we can look at replacing it with nools based targets so it can work offline.

I'm not sure about that sentinel validation but it would be great to rewrite it using views.

There's also https://github.com/nolanlawson/pouchdb-quick-search for full text indexing on client side only. We could provide some wrapper service which detects whether you're querying couch or pouch and does a lucene or pouchdb-quick-search query respectively. It's a little messy but it gives us a way to get rid of our ugly and expensive *_by_freetext views.

SCdF commented 7 years ago

Things to do:

SCdF commented 7 years ago
sglangevin commented 7 years ago

+1 for nools-based targets replacing ANC analytics

abbyad commented 7 years ago

+1 for nools-based targets replacing ANC analytics

We are doing a review of the current ANC widgets to see which ones would be needed. Some of the existing ones are no longer needed since we can make use of Tasks to highlight pregnant women that have missed their appointments. cc @diannakane

SCdF commented 7 years ago

@abbyad / @diannakane it would also be good to know, since lucene only works online, which could be replaced by Klipfolio dashboards. Either directly in klipfolio or perhaps with some kind of iframe-style arrangement.

sglangevin commented 7 years ago

@abbyad we might still need some of those for projects that are purely SMS-based, no? Or are the tasks only for the nurses? Seems like a big change from the existing workflow - perhaps I am misunderstanding though as I haven't see what you guys have worked on so far.

SCdF commented 7 years ago

So there is quite a bit of lucene going on in the API, and we should be solving things one step at a time. First we'll take Manhattan, then we'll get rid of lucene.

Looking into getting couchdb-lucene working on 2.0, I got it working (in the sense that it didn't crash, I didn't test it extensively) on 1.6. The main problem with 2.0 is that the _fti endpoint is not exposed on 5984-- the clustered endpoint-- but on 5986, the single node endpoint.

We can:

SCdF commented 7 years ago

Raised https://github.com/rnewson/couchdb-lucene/issues/239

SCdF commented 7 years ago

I've got lucene working on medic-api and presumably medic-sentinel, though it's not actually clear how I'd trigger that code:

The grossness is that to do it we attach to the secret single node port, 5986.

SCdF commented 7 years ago

Note as well that people would probably currently need a version of couchdb-lucene that is not yet released. I tested it on the master branch of couchdb-lucene.

garethbowen commented 7 years ago

I guess it works. I'm as uncomfortable as you are with this. It's probably a blocker to actually going with couchdb2.

SCdF commented 7 years ago

OK so shall we hold off merging this into the couchdb2.0 branch until:

rnewson commented 7 years ago

mango can use map/reduce and search indexes (via clouseau/dreyfus), iirc there's an 'explain' endpoint within it that will tell you which indexes it is using for your find queries to it.

for couchdb-lucene, I think the only 2.0 issue is that it's not possible to add the _fti endpoint, since couchdb 2.0 does not support that via the .ini files any longer. The replacement mechanism (couch_epi) requires adding erlang code. dreyfus does this (obviously) and constitutes an example to follow for couchdb-lucene.

SCdF commented 7 years ago

@rnewson thanks for the comment!

As I understand it we'd need to compile in clouseau / dreyfus support, correct? I've been using the MacOS binary CouchDB provides, and AFAICT the only kind of indexes I can create are Mango queries themselves.

Ideally we'd like to remove Lucene completely (bye bye JVM) and be able to support full text search offline via Pouch. One thing at a time.

rnewson commented 7 years ago

That's right, see;

https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/

and

https://cloudant.com/blog/open-sourcing-cloudant-search/

I hear you on the JVM. As far as I know there's nothing to touch Lucene in this space, though.

B.

On 2 Dec 2016, at 13:42, Stefan du Fresne notifications@github.com wrote:

@rnewson thanks for the comment!

As I understand it we'd need to compile in clouseau / dreyfus support, correct? I've been using the MacOS binary CouchDB provides, and AFAICT the only kind of indexes I can create are Mango queries themselves.

Ideally we'd like to remove Lucene completely (bye bye JVM) and be able to support full text search offline via Pouch. One thing at a time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SCdF commented 7 years ago

I've picked this back up again. I'm trying to run the couchdb-lucene in brew (1.1.0), which hasn't had any explicit CouchDB 2.0 support, against our code.

I… think it works?

When I have it set up to point to 5986 (the CouchDB2.0 local port) and try to curl http://admin:pass@localhost:5988/api/active-pregnancies (the easiest way I know to hit our lucene code) I get a bunch of stuff in the couchdb-lucene logs that looks like this:

2017-01-17 16:17:00,747 WARN [medic] 765E9F95-D966-CD48-990E-9E49F8BD2D49 caused TypeError: Expected argument of type object, but instead had type undefined (unnamed script#46)

Which I mean, I wouldn't be surprised if our schema wasn't particularly well formed, and more importantly above and below all of that I get:

2017-01-17 16:16:54,849 INFO [Config] Index output goes to: /usr/local/Cellar/couchdb-lucene/1.1.0/libexec/indexes
2017-01-17 16:16:54,880 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985
2017-01-17 16:17:00,320 INFO [medic] Indexing from update_seq start
...
2017-01-17 16:17:15,725 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 37500
2017-01-17 16:17:16,113 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 37500
2017-01-17 16:17:31,930 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 40960
2017-01-17 16:17:32,043 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 40960

Which sure sounds like it's correctly attaching to the _changes feed and doing what it should be.

Now the first time you curl active-pregnancies medic-api returns a 500:

::1 - admin [17/Jan/2017:03:16:59 +0000] "GET /api/active-pregnancies HTTP/1.1" - - "-" "curl/7.51.0"
Server error:
 { code: 500 }

But that's not massively helpful, and there is no indication it's couchdb-lucene's fault. It may be that couchdb-lucene returns those warnings, or it's timing out because it has to first build the index, or something like that.

If I run it again it "successfully" returns 0 active pregnancies without error. Scare quotes because, well, this stuff is still based on 0.4 schemas and what-not I imagine, so I'm not even really sure how it's supposed to reply. There may be other issues (like it looking for the incorrect style of data_record), but I'm not convinced it's because of CouchDB 2.0.

I'm going to try to understand how it works out what a pregnancy is and if it's active, so I can try to get it into a state where it should (and then hopefully does) return useful data.

rnewson commented 7 years ago

couchdb 2.0's local port 5986 is not what you want, database you make there will be unclustered. couchdb-lucene's master branch is now updated to be compatible with couchdb 2.0. I should really cut a point release...

B.

On 17 Jan 2017, at 03:28, Stefan du Fresne notifications@github.com wrote:

I've picked this back up again. I'm trying to run the couchdb-lucene in brew (1.1.0), which hasn't had any explicit CouchDB 2.0 support, against our code.

I… think it works?

When I have it set up to point to 5986 (the CouchDB2.0 local port) and try to curl http://admin:pass@localhost:5988/api/active-pregnancies (the easiest way I know to hit our lucene code) I get a bunch of stuff in the couchdb-lucene logs that looks like this:

2017-01-17 16:17:00,747 WARN [medic] 765E9F95-D966-CD48-990E-9E49F8BD2D49 caused TypeError: Expected argument of type object, but instead had type undefined (unnamed script#46)

Which I mean, I wouldn't be surprised if our schema wasn't particularly well formed, and more importantly above and below all of that I get:

2017-01-17 16:16:54,849 INFO [Config] Index output goes to: /usr/local/Cellar/couchdb-lucene/1.1.0/libexec/indexes 2017-01-17 16:16:54,880 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985 2017-01-17 16:17:00,320 INFO [medic] Indexing from update_seq start ... 2017-01-17 16:17:15,725 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 37500 2017-01-17 16:17:16,113 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 37500 2017-01-17 16:17:31,930 INFO [medic] View[name=_design/medic/data_records, digest=1tduwa0l1rm0s1dj74eujl2zc] now at update_seq 40960 2017-01-17 16:17:32,043 INFO [medic] View[name=_design/medic/contacts, digest=ap7af7l8s3i6x31sjic6fmq1o] now at update_seq 40960

Which sure sounds like it's correctly attaching to the _changes feed and doing what it should be.

Now the first time you curl active-pregnancies medic-api returns a 500:

::1 - admin [17/Jan/2017:03:16:59 +0000] "GET /api/active-pregnancies HTTP/1.1" - - "-" "curl/7.51.0" Server error: { code: 500 }

But that's not massively helpful, and there is no indication it's couchdb-lucene's fault. It may be that couchdb-lucene returns those warnings, or it's timing out because it has to first build the index, or something like that.

If I run it again it "successfully" returns 0 active pregnancies without error. Scare quotes because, well, this stuff is still based on 0.4 schemas and what-not I imagine, so I'm not even really sure how it's supposed to reply. There may be other issues (like it looking for the incorrect style of data_record), but I'm not convinced it's because of CouchDB 2.0.

I'm going to try to understand how it works out what a pregnancy is and if it's active, so I can try to get it into a state where it should (and then hopefully does) return useful data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SCdF commented 7 years ago

@rnewson cheers for the reply.

To make sure I understand what you're saying: if I run couchdb-lucene from master I can, with the standard CouchDB2 package (no recompilation), have couchdb-lucene.ini's [local] url pointing to the cluster port? e.g.

[local]
url = http://admin:pass@localhost:5984

Or whatever? I had got the impression somehow that we'd need to recompile CouchDB2 to support that kind of thing, but I may have gotten confused!

rnewson commented 7 years ago

Yes that's right. The two programs communicate solely via http so there's no recompile step.

Sent from my iPhone

On 18 Jan 2017, at 00:54, Stefan du Fresne notifications@github.com wrote:

@rnewson cheers for the reply.

To make sure I understand what you're saying: if I run couchdb-lucene from master I can, with the standard CouchDB2 package (no recompilation), have couchdb-lucene.ini's [local] url pointing to the cluster port? e.g.

[local] url = http://admin:pass@localhost:5984 Or whatever? I had got the impression somehow that we'd need to recompile CouchDB2 to support that kind of thing, but I may have gotten confused!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SCdF commented 7 years ago

You know, looking at this again, I'm not actually sure why we use the proxy url, considering the only user of of couchdb-lucene is local to lucene anyway (api). We do not have use cases where someone wants to directly run lucene queries from a client / browser / other server.

I'm going to change this so we just use lucene's port and then I think we're done.

SCdF commented 7 years ago

PRs:

@garethbowen plz review. Note: this still requires using the master build of couchdb-lucene. I'll create a ticket for that in the repo. @browndav is it an imposition to build from master in medic-os? How are you actually pulling in couchdb-lucene right now?

SCdF commented 7 years ago

https://github.com/rnewson/couchdb-lucene/issues/243

SCdF commented 7 years ago

Merged into couchdb 2.0 branch

ngaruko commented 7 years ago

do we need to AT this @SCdF ?

SCdF commented 7 years ago

@ngaruko yeah I think so. Specifically, a deployment of our app using CouchDB 2.0 should work properly with how we use Lucene (we use it to generate certain export stats and the like). However, this is a thing we can do in general when we're ready to support CouchDB 2.0 in production.

http://admin:pass@localhost:5988/api/active-pregnancies

Is an example of one of our endpoints that requires Lucene. In terms of ATing that this works you'd have to talk to someone other than me sorry. Our use of Lucene is not something we deal with actively and is before my time. Marc or Sharon may know more.

sglangevin commented 7 years ago

@ngaruko assigning to you for AT.

ngaruko commented 7 years ago

I am getting 'Ok' response and no errors when I run curl -i -H "Accept: application/json" -H "Co application/json" -X GET http://admin:pass@localhost:5988/api/active-pregnancies. Will close that for now.

SCdF commented 7 years ago

Hey @ngaruko, this is roughly the same level of AT I did when I was getting the connection up and running. See my comment here: https://github.com/medic/medic-webapp/issues/2935#issuecomment-273013508

My hope is that this would be AT'd in more depth, to prove that the feature actually works (ie it actually returned active pregnancies, for example), as opposed to just showing that the lucene integration throws no errors.

Is it possible for you to do that? You may need to talk with people to learn how this feature works.

ngaruko commented 6 years ago

Do we still have to worry about this now that lucene is on the 'death row'?

SCdF commented 6 years ago

IMO we are going to be releasing 3.0 before we have removed Lucene completely, so we need to support it.

However, @garethbowen may have some alternative facts to support his alternative opinion ;-)

garethbowen commented 6 years ago

Removing the rest of lucene is scheduled for 3.0.0 currently. It may not make the cut so it's probably worth testing properly.

ngaruko commented 6 years ago

LGTM