Move core search over to ElasticSearch on K8s

ryanhugh commented 6 years ago

This is a big change, but will allow us to add a bunch of features that we are hoping to add such as:

Spellcheck (#9)
Filtering (#1)

Will probably help with #21 too.

It will also fix a bug with acronyms because it allow us to apply different tokenization settings to each field, instead of just one setting to the entire index.

Also, this will allow us to add a lot more data into the search index compared to the amount we currently are searching over. Right now the server is very close to running out of RAM on the production server (1GB RAM) with only 5 semesters worth of data (3 of which are summers). We can store a lot more data in amazon.

We will have to combine all of the data into one CloudSearch index and then just filter over it in every search.

ryanhugh commented 5 years ago

Actually, I'm now thinking it would be better to move to Elasticsearch on K8s (or EC2). Lets research this some more before we start investing time moving over to one or the other

dajinchu commented 5 years ago

Preliminary version completed: https://github.com/sandboxneu/searchneu/tree/elasticsearch

Steps to setup:

Run elasticsearch 7 locally. I have it running in Docker.
Change /backend/elastic.js to your actual elasticsearch url. This is probably http://localhost:9200, but it's currently something else due to my docker-toolbox setup.
Run yarn scrape AND yarn scrape_classes This is because the class scraper doesn't reindex if in DEV mode. I think there is probably a better pipeline for indexing stuff.

Missing Feature Parity

I have not added support for seaching by: description, acronym, primaryRole, or primaryDepartment. I also do not strip middle names from "Jon S. Doe" so if you search "S" Jon will show up. It is trivial to bring these functionalities back, I just haven't done it yet.

Notes On How It Works

If I understand the current code correctly, only searchable fields are stored in the elasticlunr.js, and the actual documents were stored separately in a map, to be retrieved at query time. With elasticsearch, I am putting everything into the elasticsearch index, and telling elasticsearch not to index (most of) the fields that we aren't querying on.
Check out /backend/scrapers/esMapping.json
With Elasticsearch, it is now fairly trivial to add autocomplete predictions, "did you mean?," and all kinds of filters.
The index takes up around 30Mb of disk space
While running, elasticsearch uses around 800Mb of memory. I think this can be optimized, and it might make sense to run elasticsearch on another server/cluster of servers. Maybe K8, as Ryan mentioned.

Need Help

I have done my best to add elasticsearch without getting rid of too much of the current system. However, I think the current way of using the mappings between hashes and documents should not be necessary if the full documents are stored in elasticsearch. I think this means the whole DataLib.js stuff can be superseded by a request to elasticsearch, but I definitely need help figuring out where elasticsearch can replace current functionality.

Docker??

I am also curious what y'all think about dockerizing searchneu? With the inclusion of Elasticsearch it will become a little tougher for people wanting to contribute to setup their dev environment because they'll have to get elasticsearch and also setup their server to point to the elasticsearch url. Then the prod environment must also point to a different elasticseearch url. With Docker, the code could just point to http://elasticsearch:9200 or something, and docker can define the networking so that that points to either the local dev server or the production ES server.

NEUDitao commented 5 years ago

he used y'all teehee

Looks good though!

ryanhugh commented 5 years ago

Looks awesome!!! Great work! 🥳🎉

Run yarn scrape AND yarn scrape_classes This is because the class scraper doesn't reindex if in DEV mode. I think there is probably a better pipeline for indexing stuff.

I'll pm you about this.

Missing Feature Parity

I have not added support for seaching by: description, acronym, primaryRole, or primaryDepartment. I also do not strip middle names from "Jon S. Doe" so if you search "S" Jon will show up. It is trivial to bring these functionalities back, I just haven't done it yet.

Sounds good! That is here

Notes On How It Works

If I understand the current code correctly, only searchable fields are stored in the elasticlunr.js, and the actual documents were stored separately in a map, to be retrieved at query time. With elasticsearch, I am putting everything into the elasticsearch index, and telling elasticsearch not to index (most of) the fields that we aren't querying on.

This is correct - if we move over to elastic search we should be able to eliminate elasticlunr.js and datalib entirely. The only note is there are some spots (example) where we are using data lib for things other than direct lookup. We would have to move this call over to pulling data from elastic search too.

Check out /backend/scrapers/esMapping.json

With Elasticsearch, it is now fairly trivial to add autocomplete predictions, "did you mean?," and all kinds of filters.

Awesome!

The index takes up around 30Mb of disk space

While running, elasticsearch uses around 800Mb of memory. I think this can be optimized, and it might make sense to run elasticsearch on another server/cluster of servers. Maybe K8, as Ryan mentioned.

K8s might be super cool, but on amazon is actually quite expensive - $100 per month for the root node (at least 1 is required) and then each additional node is the same as an individual EC2 servers. Right now we probably have enough credit to cover this, but the current credit runs out in 2020...

Might just be easier to stick with EC2 while we are only supporting NEU and don't have any big scaling plans?

Need Help

I have done my best to add elasticsearch without getting rid of too much of the current system. However, I think the current way of using the mappings between hashes and documents should not be necessary if the full documents are stored in elasticsearch. I think this means the whole DataLib.js stuff can be superseded by a request to elasticsearch, but I definitely need help figuring out where elasticsearch can replace current functionality.

yup. see comment above

Docker??

I am also curious what y'all think about dockerizing searchneu? With the inclusion of Elasticsearch it will become a little tougher for people wanting to contribute to setup their dev environment because they'll have to get elasticsearch and also setup their server to point to the elasticsearch url. Then the prod environment must also point to a different elasticseearch url. With Docker, the code could just point to http://elasticsearch:9200 or something, and docker can define the networking so that that points to either the local dev server or the production ES server.

Docker doesn’t work on Windows 10 Home edition, or windows 10 student edition (you need Windows 10 professional). Docker also doesn’t work inside Windows Subsystem for Linux

Docker also doesn’t work on ARM based CPUs (like some chromebooks, eg the Asus C302).

Do we want students with these laptops to be able to work on this project? A priority of this project in the past is that it is super easy to set up - right now all you have to do is install node and yarn and run yarn start - everything else is automatic. What can we do to keep it easy to setup for development? I've got some wild ideas about what we could do here, maybe we could chat sometime.

On the other hand, if you want to use it for the deployment of the data from Travis CI to EC2 that would be totally cool 🚀

dajinchu commented 5 years ago

Okay awesome!

What's the reasoning for running the scraping on Travis?

ryanhugh commented 5 years ago

ahhahahahah I should really document some of this stuff - thats a pretty common question.

The scrapers run on Travis CI because Travis CI provides a nice UI to view and manage the scrapers logs - its got controls for running the scrapers whenever we want and and can clear a log, etc. Has a feature that can run a job once per day. Its also free!!!!

If we wanted to use an amazon server, we would have to invest time into figuring out how to view these jobs, manage the logs, start an amazon server once per day for 30 min, etc. This is all possible, its just effort. Also costs more money.

ryanhugh commented 5 years ago

Also - is it possible in elasticsearch to have a different tokenization process for each field? And elastic search supports spellcheck right? #9 And elasticsearch will still be fast?

Would love to see how many of these are fixed with elasticsearch too: #21

edward-shen commented 5 years ago

Elasticsearch doesn't support spellcheck per se, but it does support term suggestion, which offer terms based on edit distance. It also supports fuzzy searching.

As for performance, I'm pretty sure it can handle the corpus size of SearchNEU easily. I don't know how it sizes up with the current searching method for SearchNEU, but it should be in the same ballpark.

dajinchu commented 5 years ago

On my computer, elasticsearch seems to average 30ms, but I haven't really measured it much. Repeated queries are faster because it caches.

Let me do some measurements...

ryanhugh commented 5 years ago

Yeah, I totally should have gone with elastic search when I was deciding what to go with a while ago lol

And yup, I'm not worried about speed at all, mostly just curious.

We can also get rid of some of the caching in the backend too if we move to elasticsearch

dajinchu commented 5 years ago

Yeah for sure. Elasticsearch lets you specify an index-time tokenizer and query-time tokenizer for each field. I think autocomplete + "did you mean(spellcheck)" + filters would help significantly with search quality, because then the actual search wouldn't have to be fuzzy.

dajinchu commented 5 years ago

A shoddy little profiler I wrote shows that when sending requests 1 at a time, each request takes 1ms... when sending 2000 requests at once (not sure if they actually happen simultaneously - not the best with promises) it takes around 30ms. It appears that elasticsearch on my computer is running out of memory and garbage collecting, which, if you run the profiler repeatedly, causes latency to grow continuously as Elasticsearch starts falling behind. We might need to figure out how to reduce elasticsearch memory usage or run it on a beefier server. There's also production environment settings that might help.

Edit: Never mind. After installing Elasticsearch on Windows instead of through Docker, it's working a lot better. Memory usage fluctuating around 200mb-400mb. Sending 2000 requests at once takes 1-5ms each. Highly dependent on caching. Running the profiler twice in a row will result in 0-1ms responses the second time. Also probably worth noting this measures the time for elasticsearch to perform the query, ignore the latency of serializing json and sending the HTTP response.

ryanhugh commented 5 years ago

Super cool! Awesome - that sounds great. Not surprised that it works better outside of docker than inside

ryanhugh commented 5 years ago

Just to clarify - the following operations are done on the current database (ElasticLunr + DataLib) and will need to be supported by any DB we move to:

Searching
- Needs to be fast
- support caching
- different tokenization on different fields
- filtering #1
- spellcheck #9
- able to manage multiple search indexes
- scalable - just in case we want to scale
- support some scoring method
- be able to merge results from multiple indexes to one results (eg employees + classes -> one results)
- and all for a reasonable $$ cost.
- Lookup all subjects in a host+termId combination. Eg, lookup all subjects in neu.edu/201910. This operation does not need to be fast and only runs once when the server starts.
- Lookup all classes in a host+termId+subject combination. Eg, lookup all classes in neu.edu/201910/CS. This operation runs if a user types in "CS" and all the classes in that subject come up, in order.
- Get and set individual classes and sections. The updater runs once every 5 minutes, will pull individual classes (~<50) re-scrape them, and then update the entry in the DB. This does not need to update the search index, just the DB.
- API: Doesn't need to be complex or fast at all - I think just a few endpoints to "get everything" like we have now is more than enough. link
random other notes
- Because NEU has all the data we have too we don't have to make backups of the data or be super careful about our data - if anything goes wrong and we lose the data, we can just re-scrape the data and get everything back.
- Also, there is some user data in firebase. Firebase is working great for user data and I see no reason we have to change that. Costs $0.

dajinchu commented 5 years ago

Sweet, thanks for listing this all out. I am thinking the best way to incorporate elasticsearch into the current code base is to just re-implement all of the DataLib functions using elasticsearch queries.

As far as multiple search indexes go, we can also just have 1 index and use filters, which is what is being done now. As far as I can tell, there's no significant performance difference, thanks to the inverted index.

All in all, I think this is all very doable using Elasticsearch as the single source of truth. However, at some scale it might be more performant to only put the fields that we search over in Elasticsearch, and the rest of the detailed data in a relational database like Mongo, but I really doubt we have enough data to cause performance issues that would warrant the additional complexity.

ryanhugh commented 5 years ago

Exactly what I'm thinking - a route that is looking solid to me is just put all the data in elasticsearch and only index some fields. Elasticsearch can support many different indexes, so we just make a different one for each semester (same thing we are doing now). Also, I'm going to add more info to that comment, one sec

ryanhugh commented 5 years ago

As far as multiple search indexes go, we can also just have 1 index and use filters, which is what is being done now. As far as I can tell, there's no significant performance difference, thanks to the inverted index.

They might have a performance impact? I PM'ed ya about this but it might be better to keep the model of 1 index per semester rather than merge them all into 1 big one

ryanhugh commented 4 years ago

Just pushed to production!!!!! The backend is now officially ElasticSearch thanks to some great work by DJ!

ryanhugh / searchneu