Research freetext view use to inform hosting TCO reduction

mrjones-plip commented 1 month ago

Describe the issue it's been decided that storage is the highest cost for hosting total cost of ownership (TCO). It's suspected that CHT's use of CouchDB freetext views cause a large amount of disk space. But how much? How might the size of the view be reduced? Where does the UI surface the use of these views?

Describe the improvement you'd like research and document use of freetext views so we know how to safely reduce the disk use caused by using them

Describe alternatives you've considered na

Next Steps:

[x] Check Couch/HAProxy logs for instances of the views being used
[ ] community research
[ ] Remove views and run e2e tests and see what happens
[ ] Add logging for the view and manually poke around the app

Open Questions:

Where are the views being used programmatically?
What fields are being searched for?
What are users actually searching for?

CouchDB views for freetext searching

These get queried by the @medic/search shared lib:

code that generates queries against the 2 contact-related views: shared-libs/search/src/generate-search-requests.js
code that generates queries against the report-related view: shared-libs/search/src/generate-search-requests.js

These generated queries are then executed one level higher in shared-libs/search/src/search.jsThis is then imported and ran by:

the contact exporter in the API: api/src/services/export/contact-mapper.js
the report exporter in the API: api/src/services/export/report-mapper.js
the search service in the admin app: admin/src/js/services/search.js
- which is used by this other service to filter through contact types and places types when editing a user in admin/src/js/controllers/edit-user.js
the search service in the webapp: webapp/src/ts/services/search.service.ts

logs are VERY informative, even showing the keyword freetext !here's couch logs when searching for bbbbbbbbbbbbbbbbbbbbbbbbbb:

[notice] 2024-07-31T14:12:35.171290Z couchdb@127.0.0.1 <0.16948.0> 8d6a7d7e03 192-168-68-235.local-ip.medicmobile.org 172.20.0.7 medic GET /medic/_design/medic-client/_view/contacts_by_freetext?startkey=%5B%22bbbbbbbbbbbbbbbbbbbbbbbbbb%22%5D&endkey=%5B%22bbbbbbbbbbbbbbbbbbbbbbbbbb%EF%BF%B0%22%5D 200 ok 9

As well, it shows up in HA proxy for online users when searching for aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:

<150>Jul 31 14:11:16 haproxy[12]: 172.20.0.7,couchdb,200,5,0,0,GET,/medic/_design/medic-client/_view/contacts_by_freetext?startkey=%5B%22aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%22%5D&endkey=%5B%22aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%EF%BF%B0%22%5D,-,medic,'-',276,5,-,'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0'

garethbowen commented 1 month ago

An additional "next step" that I think would be informative is taking a representative database and deleting the freetext views from the ddoc, calling view_cleanup, and then comparing the on disk size of the db. If it's not significant then we're barking up the wrong btree.

m5r commented 1 month ago

I've done some research over the last few days with a relatively large dataset to give us an idea of how much disk space we could save for large instances. I've generated 5.94M contacts and 840k reports with the test data generator that amounts to a 5.1GB medic database and an 18GB medic-client. For comparisons, one of our largest deployments has 29.3M documents in the medic database that amounts to 80GB and a 187GB medic-client.

NB: when measuring the medic-client ddoc, I'm looking at the data size on disk that CouchDB returns when querying GET /medic/_design/medic-client/_info.

mrjones sez:

You can see check this amount with curl and jq. I loaded the data up using the comment below and confirm (but I get 15GB instead of 18!?)
curl -qs http://admin:groot@localhost:5984/medic/_design/medic-client/_info | jq '(.view_index.sizes.file/1000/1000/10|round/100)'
15.19

In this PR I managed to shrink medic-client from 18GB to 10.2GB so roughly -43%. With similar proportions, that would save about 80GB of storage on that large deployment I mentioned earlier. This is very much napkin math to get an idea of what to expect but it doesn't look as big of a shrink as I initially thought :sweat_smile: In my opinion, it's still a win but before moving forward we should see how these views are queried in production and understand the intention behind those queries to figure out what are the expected results. With that piece of info we can better adjust what the view indexes and make sure we reduce the chance of potentially shipping a breaking change in the search behavior.

NB2: The PR doesn't pass CI because some integration and e2e tests are failing. They're few but they rely on the fulltext views to index every field of every documents, they query arbitrary fields and expect the values on those fields to get picked up by the index and have those documents turn up in the search. But basic searches work as expected and their respective tests pass :green_circle:

Out of curiosity I deleted the 3 _fulltext views and measured again, this time medic-client went down to 6.2GB. So those 3 fulltext views account for roughly 2/3rd of the size of medic-client. Triggering the view cleanup didn't make any visible difference.

What to do next:

look at production logs to see how it's used to adjust the view indexing with users' usage and expectations
decide if it's the right thing to pursue considering the amount of disk space estimated to save
reproduce the experiment on production data by:
- cloning a large production instance
- measuring disk space (on the machine with df for example, and via couch APIs)
- applying desired changes to the _freetext views
- :exclamation::exclamation: measuring how long it takes for couch to re-index the views
- and finally measuring how much disk space we saved
look for other approaches to reduce hosting costs:
- reduce the size of other ddocs, for reference, on that large deployment I mentioned earlier, medic-scripts represents 21.9GB, medic-conflicts 32.7GB, medic-admin 6.3GB, and medic (the ddoc) 34.7GB
- find a more storage-effective way to do fulltext search (I'm not very convinced we can achieve something meaningful here, client-side search fulltext search in general is not easy)
- find any other low hanging fruit to reduce RAM/CPU requirements (upgrading couch, optimizing a costly algorithm, that kind of thing...)

mrjones-plip commented 1 month ago

If anyone needs to reproduce testing on a large data set, here's the steps from @m5r :

Download the archive (private google drive link)

compose.yml file:

services:
  couchdb:
    image: public.ecr.aws/medic/cht-couchdb:4.9.0
    volumes:
      - ${COUCHDB_DATA:-./srv}:/opt/couchdb/data
      - cht-credentials:/opt/couchdb/etc/local.d/
    environment:
      - "COUCHDB_USER=${COUCHDB_USER:-admin}"
      - "COUCHDB_PASSWORD=${COUCHDB_PASSWORD:?COUCHDB_PASSWORD must be set}"
      - "COUCHDB_SECRET=${COUCHDB_SECRET}"
      - "COUCHDB_UUID=${COUCHDB_UUID}"
      - "SVC_NAME=${SVC_NAME:-couchdb}"
      - "COUCHDB_LOG_LEVEL=${COUCHDB_LOG_LEVEL:-error}"
    restart: always
    logging:
      driver: "local"
      options:
        max-size: "${LOG_MAX_SIZE:-50m}"
        max-file: "${LOG_MAX_FILES:-20}"
    networks:
      cht-net:

volumes:
  cht-credentials:

networks:
  cht-net:
    name: ${CHT_NETWORK:-cht-net}

and the couchdb-override.yml

services:
    couchdb:
        ports:
          - "5984:5984"
          - "5986:5986"

run it with:

COUCHDB_USER=admin COUCHDB_PASSWORD=groot docker compose -f ./compose.yml -f ./couchdb-override.yml up -d

garethbowen commented 1 month ago

@m5r Fantastic data, thanks!

I had a quick look at the PR and I think we can reduce disk space a little further by not emitting fields with <key>:<value>. That feature is poorly documented and I doubt anyone uses it in production. If that's the case it would be good to check if that makes a measurable additional improvement to disk usage.

measuring how long it takes for couch to re-index the views

This will also be important to keep an eye on - thanks for calling it out. I know the PR is in very draft state, but there are many improvements we can make to it to improve compile time with this new approach, for example, not iterating over all fields, but only checking those that are explicitly to be indexed. So all that is to say, it might be better to measure this once we've decided this is the right approach and have made the other improvements to the view.

It would also be good to check the total indexing time compared to the current freetext views because new documents will have to be indexed either way, and if this one is significantly faster then that will be another upside to consider.

jkuester commented 3 weeks ago

Out of an abundance of caution, I have taken the time to drill down a bit further into the source code that is using shared-libs/search and can confirm that there does not seem to be any programatic usage of the freetext searching (this is very much what we expected). The freetext search queries always originate from the user entering the query text either into a select2 search box or into the search box on the reports/contacts tabs.

the contact exporter in the API: api/src/services/export/contact-mapper.js
- contacts search
- Used in services/export-data.js - exportStream
  - Used in conrollers/export-data.js - get
    - Used in the /api/v2/export/:type REST endpoint to accept freetext search queries.
      - Used in webapp export.service:
        
        Contact export in contacts.component - uses freetext search to be able to export whatever the user had searched for.
        
        Reports export in reports.component - uses freetext search to be able to export whatever the user had searched for.
the report exporter in the API: api/src/services/export/report-mapper.js
- reports search
- Ditto for the above contact-mapper usage
the search service in the admin app: admin/src/js/services/search.js
- which is used by this other service to filter through contact types and places types when editing a user in admin/src/js/controllers/edit-user.js
  - contacts search - select2 service uses freetext to find people/places
the search service in the webapp: webapp/src/ts/services/search.service.ts
- webapp/src/ts/modules/reports/reports.component.ts
  - reports search - freetext to filter context based on what is entered into the search bar
  - Interestingly can load the search query from this.route.snapshot.queryParams.query
- webapp/src/ts/modules/contacts/contacts.component.ts
  - contacts search - freetext to filter context based on what is entered into search bar
- webapp/src/ts/services/contact-view-model-generator.service.ts
  - reports search = based on subjectIds, no freetext search
- webapp/src/ts/services/form.service.ts
  - reports search - based on subjectIds, no freetext search
- webapp/src/ts/services/select2-search.service.ts
  - contacts search - uses freetext to find people/places
- webapp/src/ts/services/target-aggregates.service.ts
  - contacts - based on types, no freetext search

jkuester commented 3 weeks ago

Based on the above findings that we do not have any programatic requirements around which fields are available in the freetext search, I would like to start a more detailed design discussion regarding exactly which fields we should index. @garethbowen I would appreciate your feedback on this proposal!

We should only index a field if:

It contains data that a user could be expected to know. (e.g. do not need index fields that contain UUID values, or other codified data that would not be visible to a user)
It is not overly generic such that there would be no value in searching. (e.g. it would be unhelpful to search for a contact by sex or for a report by form)
It is a "known" field that is likely to be used in most CHT deployments. (Not a custom field that is specific to a certain project.)

Given this criteria, here are the fields that I propose indexing:

Contact docs

patient_id
place_id
case_id
name
~~phone~~
~~date_of_birth~~
external_id? - No special CHT logic attached to this field, but maybe its usage is common enough in the ecosystem to make it useful to index.
house_number? - Used in Nepal by users to for an address

Report docs

fields.patient_id
fields.place_id
fields.case_id
fields.patient_name

Additionally, I propose that we index a new ~~freetext_search_index~~ search_keywords field on both contacts and reports. This field would allow partners to be able to include data from other fields into the search index. As long as everything is space-separated, they should be able to combine data from multiple actual fields into freetext_search_index.

Also, for completeness I just want to include the further recommendations Gareth made above. When we re-work the freetext view code, we should:

Avoid emitting the <key>:<value> entries
- Except for case_id:${id} - this we need to keep for the special reports-page linked search functionality.
Directly check for the intended fields we want to index on instead of looping through all of Object.keys.

mrjones-plip commented 3 weeks ago

Are all the *_id fields listed above UUIDs such that we're expecting folks to search for them or are they shorter IDs (eg 3-8 digits)? If the former, seems like we shouldn't index, yah?

jkuester commented 3 weeks ago

Those are all "shortcodes" and not UUIDs.

mrjones-plip commented 3 weeks ago

Perfect! thanks for the confirmation.

garethbowen commented 3 weeks ago

there does not seem to be any programatic usage of the freetext searching

I don't think this is correct - there is a very special use case which searches by case_id by creating a link on the field value: https://github.com/medic/cht-core/blob/34dd0303c28230c2db271e5aaec7c7780c0655d2/webapp/src/ts/services/format-data-record.service.ts#L106

I don't know if this is even used, so we could investigate to see if we can drop it, but this is a big job. Alternatively keep case_id:<shortcode> as a special case in freetext search, which if it isn't used, won't take up any disk space. Or we could rework it to be a special kind of filter so it uses a custom view rather than the generic freetext view.

I propose indexing...

I'm not sure about indexing contact phone number. I assume we'd index the standardised format, but nobody is going to type that in, so we'll need to also standardise the input, which won't work if anyone uses space separation. At this point I would leave it out, and we can add it back in if someone misses it.

Likewise I don't understand indexing the contact dob. It's skipped in the current implementation but I don't know how you would format it and how users would work out how to type it in the correct format. Also would it need to be exact or would we fuzzy match, looking for someone born in "september 2023". Can you give some examples of how this would work, what use case this would serve, and how users would discover it?

I propose that we index a new freetext_search_index field on both contacts and reports.

This is a nice idea, and means there's some extensibility for use cases we haven't thought of. It feels like a different issue, but if we do it in two releases it'll mean two expensive view reindexes so I appreciate you bringing it up now - maybe it's cleanest as two commits in the same release. How would app developers populate this field? Would it actually be fields.freetext_search_index so it's built up by enketo? Or would there be some custom code to hoist the value up to the root? Also may I recommend it's called something more intuitive, like keywords or search_keywords? I think it warrants discussion with app devs - can you raise a new issue for this to get the conversation going? Then we can decide whether there's enough alignment to get it added in the same release.

jkuester commented 3 weeks ago

there is a very special use case which searches by case_id by creating a link on the field value

:sweat: Thanks for pointing this out. I was even originally looking for the case_id usage, but just could not find the connection back to the search service. (Turns out the code path is: format-data-record.service > report-view-model-generator.service > reports-content.component HTML > reports-content.component TS. It was the re-direction through the HTML code to call the search function that I missed. :facepalm: ).

IMHO, the "proper" way to handle this functionality would be to rework it as a filter with a more dedicated view. However, I doubt we want to invest that kind of effort into this functionality right now. The most straightforward thing would be to just leave a special case in our freetext view code for case_id:${id}. The only reason I can think of to not do that would be that if we ever want to fix this up in the future, it would mean re-building the freetext views.... I guess the current approach is already not working well with the redesigned search: https://github.com/medic/cht-core/issues/8419

On a related note, @garethbowen do you know what is up with the search on the reports tab via URL query param functionality? Diana seemed to think it was related to the case_id functionality, but it does not seem like that is how it works anymore (given the code flow I mapped above). Not sure if there is something else I am missing here... :worried:

I'm not sure about indexing contact phone number

I agree with the formatting concerns here and do believe that the "freetext" search is not the best way to implement a "find-contact-by-phone" feature. That being said, my thinking was that the user would likely be able to see the formatted version of the phone number in the contact-summary. So, they would at least have a chance of knowing what to search for. I am fine not including phone in the index, particularly since if anyone misses it, it will give us an opportunity to perhaps give a proper solution for that workflow.

I don't understand indexing the contact dob. It's skipped in the current implementation

:thinking: I don't think this is accurate. If you are thinking of this check for fields that end with date in the view map code, then this actually does not apply to the date_of_birth field.... :sweat_smile: (And anecdotally I tried it out and was able to find contacts by searching for the date of birth.)

but I don't know how you would format it and how users would work out how to type it in the correct format. Also would it need to be exact or would we fuzzy match, looking for someone born in "september 2023". Can you give some examples of how this would work, what use case this would serve, and how users would discover it?

As with the phone number, I 100% agree that the formatting here is the main issue. At the same time, I guess it is possible to display that DoB to the user in the standard yyyy-mm-dd format in the contact summary and then the user could search based on what they see. As with the phone, this is bad UX and I am find not included the DoB in the index (and if anyone asks for it, we can implement a proper search). A purely generic freetext search like this is just not the best way to provide this functionality.

Mostly, my consideration around all these index fields was simply for the kinds of things that folks usually use as reference for finding people. Name/phone/DoB are def the most common that I know of. After reading what you wrote and thinking about this more, I think our generic contact search is probably not super useful for anything but searching by name and shortcode. More custom handling (and UI) would be needed to make searching by phone/DoB really useful or intuitive...I n the meantime, if folks really want to be able to search by phone/DoB, they can always just add that data to their search_keywords field!

search_keywords? I think it warrants discussion with app devs - can you raise a new issue for this to get the conversation going?

Issue logged: https://github.com/medic/cht-core/issues/9378. Will raise a discussion on Slack. :+1:

garethbowen commented 3 weeks ago

On a related note, @garethbowen do you know what is up with the search on the reports tab via URL query param functionality?

Commented in Slack but repeating here for posterity and visibility. I've traced it back to this commit (which is exactly 10 years old this Sunday :cake: ). This was added for this issue. This was done so you could link to the reports tab and show all reports about a given patient.

I don't understand indexing the contact dob. It's skipped in the current implementation

I don't think this is accurate

I was thinking of the type check but realising now that dob will either be a number or a string - thanks for refreshing my memory.

I think our generic contact search is probably not super useful for anything but searching by name and shortcode

Yes this is basically where I've landed too.

m5r commented 1 week ago

Rebasing the PR against 4.10.x and applying the changes mentioned in this thread brought the size of medic-client down to 9.8GB from 10.2GB :eyes:

mrjones-plip commented 1 week ago

can't wait for a passing build so that we can try this branch on the copy of production data!

in the meanwhile to facilitate local testing on dev instances, I've added everyone here's SSH key to the test box and I'm creating a massive tar.gz of production data. USE EXTREME CAUTION when copying this locally - it has full PII and PHI. You may not copy it to your local machine unless you have full disk encryption and a strong password on your device - full stop!

To download the 33GB file:

scp ubuntu@ec2-18-133-29-68.eu-west-2.compute.amazonaws.com:/opt/couchdb/moh-zanz-prod-data/storage/medic-core/prod.moh.zanz.couchdb.tar.gz .

Anyone who can SSH in to watchdog can copy this file. Here's the last ~50 chars of the SSH keys on it:

ssh ubuntu@ec2-18-133-29-68.eu-west-2.compute.amazonaws.com "egrep -o '.{50}$' .ssh/authorized_keys|cut -f1 -d'@'"
iEyXlktTxE0synSdvWE4CNOJPE mrjones-nubuntu-aws-pem
HDPBOLsajCEAxP/Cn+uGQwPdBz89q5D3 mrjones
rnP39+7Y3CrMOco8BFCVQitOLwnwFEgFOi mrjones
34cPwUIFXapstbch8XfLn3rfR mrjones
ELr321BrFqeGY1sBGOc+VUiDjcBwj//Q== jkuester
B/J7wvY/SGXqwHVXxWuLtVcj5RWq6ugXYCuCoAZ m5r
NeATBNbTkgE+zEHBoHyVorqJkNoPfMnWwZr eljhkrr
DOLkpvVM7ToIsHSXqKlmipkMC+ih6zwOPQ== yrimal
FuzFoexe0xSwHwdYwgAR6HXjw7DkgHR6jV++ yrimal
nA8Gt3c5K0gIsEAPMo2N+V81w7SZheSSVggr sugat
gbwQXdI1ZqFczd1NfIPOk1Qz0rf4tqqkrb/e sugat
uQLQUAp5NtQlce1jbGehAL3bd8rxqChyGrab sugat
pmEYTImgLznmwCOyp5iz3icMAha/CG/I3TiX sugat
8aZt9qIvh/Q0llDGZMw8yM3ZZWUczxlVKNVN hareet
Wuayc+etBS2QATSf4oVlMX5CytWEPxojMDai7 henok
FuzFoexe0xSwHwdYwgAR6HXjw7DkgHR6jV++ yuvraj

mrjones-plip commented 1 week ago

cc @m5r @jkuester and @sugat009 :point_up:

m5r commented 3 days ago

Thanks for the prod data and the words of caution @mrjones-plip

I pulled it down, made CouchDB index it and warm up the views, and then pushed the new freetext search views. The medic-client ddoc went down from ~21.5GB to ~8.4GB (and ~14.9GB to ~5.2GB size on disk) :eyes:

medic / cht-core

Research freetext view use to inform hosting TCO reduction #9293

Contact docs

Report docs