Upgrade Solr to 6.x, address OOM problems

mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.

http://www.mediacloud.org

GNU Affero General Public License v3.0

281 stars 87 forks source link

Upgrade Solr to 6.x, address OOM problems #24

Closed pypt closed 8 years ago

pypt commented 8 years ago

We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).

It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.

Loose checklist:

Install Solr using Gradle (#23) for easier upgrades.
Upgrade 4.6.0 to 4.10.4 in solr_gradle/4.10.4 branch

Should be pretty safe and easy, basically a version number change in Gradle's config file.
Upgrade 4.10.4 to 5.5.1 in solr_gradle/5.5.1 branch

Will need more extensive testing and some time with the changelog.
Upgrade 5.5.1 to 6.0.1 in solr_gradle/6.0.1 branch

Ditto with more extensive testing.
After testing the whole 4.6.0 -> 4.10.4 -> 5.5.1 -> 6.0.1 upgrade path, merge branches one by one to release and deploy to mcquery*.

hroberts commented 8 years ago

I'm up for transitioning to gradle as part of the process. I'd love not to have solr sitting in our repo.

I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?

For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.

-hal

On 6/14/16 7:50 PM, Linas Valiukas wrote:

We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).

It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.

Loose checklist:

1.
Install Solr using Gradle (#23
<https://github.com/berkmancenter/mediacloud/issues/23>) for easier
upgrades.
2.
Upgrade 4.6.0 to 4.10.4 in |solr_gradle/4.10.4| branch

Should be pretty safe and easy, basically a version number change in
Gradle's config file.
3.
Upgrade 4.10.4 to 5.5.1 in |solr_gradle/5.5.1| branch

Will need more extensive testing and some time with the changelog.
4.
Upgrade 5.5.1 to 6.0.1 in |solr_gradle/6.0.1| branch

Ditto with more extensive testing.
5.
After testing the whole 4.6.0 -> 4.10.4 -> 5.5.1 -> 6.0.1 upgrade
path, merge branches one by one to |release| and deploy to |mcquery*|.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/24, or mute the thread https://github.com/notifications/unsubscribe/ABvvTw8vpxe4Hbxv4hPg-4dkYKFahD7Zks5qL0xvgaJpZM4I15r9.

Hal Roberts Fellow Berkman Center for Internet & Society Harvard University

pypt commented 8 years ago

I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?

Isn't this what I'm planning to do, in incremental steps?

For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.

Can we regenerate the index after the migration?

hroberts commented 8 years ago

If you are just going to diff the config files and manually write the new config files from the diffs, I don't see why you would do incremental updates. I was assuming that the incremental updates were for the sake of the data files.

Yes, we can regenerate the index as part of the migration. Even if we already have the csv files dumped (the first half of the import process), it still takes a couple of days to generate the index from the csv files. If we wait until the disks are less than half full, we can generate the new index while the old one is still running and have only a few minutes downtime.

pypt commented 8 years ago

As Solr code repo cleanup (#23) is done (but not deployed), I'm slowly working on the upgrade branches in order to be able to do the upgrade on Friday early morning too.

So far got Solr 5.0.0 to start and run fine. Solr 5.5.2 has some undocumented ZooKeeper changes that I'm trying to figure out. Hope I'll manage to get to 6.1.0 before Friday.

As for the configuration changes, I'm just merging newer Solr version's configuration to older version's config and cherry-picking the changes from new config, e.g. see 143ae6c.

pypt commented 8 years ago

Well, that was quite a journey, but I managed to implement and test out the 4.6 -> 4.10 -> 5.0 -> 5.5 -> 6.0 -> 6.1 upgrade:

8 nodes on Solr 6.1.0

Solr version support is implemented in their respective branches:

solr_install_externally - Solr 4.6.0 (our current version, installed and set up automatically; see #23)
solr_4.10.4 - Solr 4.10.4 (upgrade to the last Solr 4 version)
solr_5.0.0 - Solr 5.0.0 (upgrade to the first Solr 5 version)
solr_5.5.2 - Solr 5.5.2 (upgrade to the last Solr 5 version)
solr_6.0.0 - Solr 6.0.0 (upgrade to the first Solr 6 version)
solr_6.1.0 - Solr 6.1.0 (upgrade to the latest Solr)

Done:

Merged in new Solr configuration templates into existing configuration while retaining Media Cloud's custom options (schema, memory limits, etc.)
Added scripts to upgrade Lucene indexes when migrating between major Solr versions and optionally optimize (basically recreate) them afterwards

Configuration changes:

Removed <maxIndexingThreads> because it's missing in Solr 5.5 sample configuration file
Disabled document clustering because we didn't seem to use it anyway
Updated GC parameters to match the ones set by bin/solr

Loose upgrade plan:

Install Java 8 (required by Solr 6)
Verify that 7981-7988 are open on mcquery*
Backup current Solr data to Faith
Set up mediawords.yml to point to mcquery2:7981 (shard 1) instead of :8983 (which is now reserved for standalone Solr instances used for development)
Merge solr_install_externally, verify that it (still) works
Merge solr_4.10.4, run shards, verify that it still works
Merge solr_5.0.0, upgrade Lucene indexes, run shards, verify that it still works
Merge solr_5.5.2, run shards, verify that it still works
Merge solr_6.0.0, upgrade Lucene indexes, run shards, verify that it still works
Merge solr_6.1.0, upgrade Lucene indexes (just in case), run shards, verify that it still works
(optionally) Optimize index to make use of improved indexing in Solr 6 (recommended by the docs)
Profit

Should we do this in steps, or can I just shutdown Solr for half a day or so? Upgrading indexes might be the lengthy part in this plan, I'm not sure if I'd manage to do the upgrade in an hour or so.

pypt commented 8 years ago

Tried to do upgrade on live data. Suffice to say that it didn't work, basically due to OOM errors which were more even more prominent with Solr 6.1.0.

Currently uploading dump from Faith to S3 in order to replicate the issue on EC2 machines.

pypt commented 8 years ago

Upgraded test dump to Solr 6.1.0, created three 600 GB EBS volumes (+snapshots for backup) for testing our Solr setup on AWS.

About to create three test EC2 instances resembling mcquery* and try to replicate situations leading to OOM. Deciding between m4.10xlarge (40 vCPU, 160 GB RAM, $2.394/hour) and r3.4xlarge (16 vCPU, 122 GB RAM, $1.33/hour).

pypt commented 8 years ago

I've set up test Solr deployment on EC2. Three m4.10xlarge machines cost $2.39/hour each, so they're stopped now, please start them manually if you want to have a look yourself. Your SSH key is set up there too. To access Solr, create a SSH tunnel to mediacloud-solr-6-mcquery2:7981.

Some superficial findings:

I can replicate the awful performance that we saw when trying to deploy Solr 6.1.0 to production, but unfortunately I can't trigger the OutOfMemory exception even though those machines have even less RAM. How come that the shards run out of memory on production but not on EC2 instances? It can't be just load. Did you do any undocumented configuration changes on mcquery* by any chance?
Solr shards keep timing out their connections to ZooKeeper right after starting them and trying out the first query, so the shards appear and disappear randomly during the first 10 minutes of the cluster running. I've tried increasing ZooKeeper's tickTime (9105610) but that didn't help. I suspect this is related to FieldCache being rebuild.
First 4-5 attempts to query for something (e.g. http://localhost:7981/solr/collection1/select?q=sentence%3Aobama&sort=random_1+asc&rows=50&wt=json&indent=true&debugQuery=true) are superslow, run for 7-8 minutes or so, but manage to return without OOM problems. After that, queries suddenly become faster -- obviously not just the ones that were just run, but various other strings can now be searched for 30-40 s. This coincides with Lucene's FieldCache (not Solr's fieldValueCache) getting filled up.
I think that FieldCache "prewarms" using queries specified in newSearcher and firstSearcher properties, but haven't yet looked into them.
Why it's so slow on Solr 6 specifically? I'll try running Solr 6.1.0 with 4.6.0's schema to make sure I didn't mess up the configuration when upgrading.
Why are we keeping so many fields in the schema, and why do we store most of them? I have a suspicion that we're forcing Solr to be a database (which it is not), and storing all those fields just for the sake of our own convenience will become an even bigger problem once we push more data into it. Couldn't we store just story_sentences_id, remove (most) of the fields that could be recovered from sentence's ID (read the rest of the data from PostgreSQL) and thus reduce the memory footprint tenfold?

pypt commented 8 years ago

Why it's so slow on Solr 6 specifically? I'll try running Solr 6.1.0 with 4.6.0's schema to make sure I didn't mess up the configuration when upgrading.

Some good news - 4.6.0 configuration seems to work fine on 6.1.0 without any slowdown or OOM problems, so some new config property between 4.6.0 and 6.1.0 is probably the culprit.

Working on figuring out which one it could be.

hroberts commented 8 years ago

Here are the fields we're keeping in the index, with comments for each:

solr_id (stored)- this is required, and we use a concatenation of the ss_id and the s_id in a special format that instructs solr to store sentences from the same story on the same shard, which allows us to quickly count story matches; we don't need this to be stored for any reason, but solr may require it to be stored as the primary id;
story_sentences_id (stored) - we need this as a stored value so that we can lookup returned sentences; we don't every actually query this in practice, so we could stop indexing it, but I can imagine cases in which it would be useful to query;
media_id (not stored) - query by media_id is a core function;
stories_id (stored) - query by stories_id is a core function; we need to store this so that we can quickly process queries grouped by stories_id, otherwise we would have to get the ss_id back and then query the ss table for the stories_id values, which would slow things down a lot;
sentence_number (not stored) - I can imagine wanting to query this, but I don't think we actually ever have, so I think we could drop it;
processed_stories_id (stored) - we need this indexed and stored to be able to implement the query streaming features in the api (last_processed_stories_id=);
sentence (stored) - query by sentence is a core function, but we could rewrite the perl Solr.pm module to query sentences from the ss table instead using sentences returned by solr; we will need to do this to have room to fit the story_text: field into the index;
title (stored) - (see sentence)
publish_date (not stored) - query by date is a core function
publish_day (not stored) - faster version of publish_date for querying by day
language (not stored) - query by language is a core function
bitly_click_count (stored) - we need this to be able to sort by bitly clicks, which we need if we ever want to display influential stories list on the dashboard; I don't think we need to store this, though.
media_sets_id (stored) - obsolete; we should get rid of it, but we don't actually write this to the db any more so it won't save any space.
tags_id_stories|media (not stored) - core query
tags_id_story_senences (stored) - core query; I think we need these stored values to implement sentences/field_counts

pypt commented 8 years ago

I went through the full config again, and apparently I have accidentally set docValues="true" (confirmed in a7dd72c, reverted back in 85404e1) while doing the 4.6.0 -> 4.10.4 -> 5.0.0 -> 5.5.2 -> 6.0.0 -> 6.1.0 migration (it was a default proposed setting in one of Solr's example schemas).

I'm not quite sure what Solr was doing with the new argument on all fieldTypes, but documentation says that If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues, so it was probably doing that.

So, now Solr 6.1.0 seems to be running just fine with 2 TB of data and pretty much doesn't use any memory at all. API calls tested with live data:

sentences/count (required small changes in eccce95)
stories/count
stories/list
wc/list
tags/list

Here are the fields we're keeping in the index, with comments for each:

Thanks, I have moved your comments into schema.xml. Let's move schema cleanup into a separate task.