mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

Upgrade Solr to 6.x, address OOM problems #24

Closed pypt closed 8 years ago

pypt commented 8 years ago

We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).

It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.

Loose checklist:

  1. Install Solr using Gradle (#23) for easier upgrades.
  2. Upgrade 4.6.0 to 4.10.4 in solr_gradle/4.10.4 branch

    Should be pretty safe and easy, basically a version number change in Gradle's config file.

  3. Upgrade 4.10.4 to 5.5.1 in solr_gradle/5.5.1 branch

    Will need more extensive testing and some time with the changelog.

  4. Upgrade 5.5.1 to 6.0.1 in solr_gradle/6.0.1 branch

    Ditto with more extensive testing.

  5. After testing the whole 4.6.0 -> 4.10.4 -> 5.5.1 -> 6.0.1 upgrade path, merge branches one by one to release and deploy to mcquery*.
hroberts commented 8 years ago

I'm up for transitioning to gradle as part of the process. I'd love not to have solr sitting in our repo.

I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?

For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.

-hal

On 6/14/16 7:50 PM, Linas Valiukas wrote:

We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).

It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.

Loose checklist:

1.

Install Solr using Gradle (#23
<https://github.com/berkmancenter/mediacloud/issues/23>) for easier
upgrades.

2.

Upgrade 4.6.0 to 4.10.4 in |solr_gradle/4.10.4| branch

Should be pretty safe and easy, basically a version number change in
Gradle's config file.

3.

Upgrade 4.10.4 to 5.5.1 in |solr_gradle/5.5.1| branch

Will need more extensive testing and some time with the changelog.

4.

Upgrade 5.5.1 to 6.0.1 in |solr_gradle/6.0.1| branch

Ditto with more extensive testing.

5.

After testing the whole 4.6.0 -> 4.10.4 -> 5.5.1 -> 6.0.1 upgrade
path, merge branches one by one to |release| and deploy to |mcquery*|.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/24, or mute the thread https://github.com/notifications/unsubscribe/ABvvTw8vpxe4Hbxv4hPg-4dkYKFahD7Zks5qL0xvgaJpZM4I15r9.

Hal Roberts Fellow Berkman Center for Internet & Society Harvard University

pypt commented 8 years ago

I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?

Isn't this what I'm planning to do, in incremental steps?

For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.

Can we regenerate the index after the migration?

hroberts commented 8 years ago

If you are just going to diff the config files and manually write the new config files from the diffs, I don't see why you would do incremental updates. I was assuming that the incremental updates were for the sake of the data files.

Yes, we can regenerate the index as part of the migration. Even if we already have the csv files dumped (the first half of the import process), it still takes a couple of days to generate the index from the csv files. If we wait until the disks are less than half full, we can generate the new index while the old one is still running and have only a few minutes downtime.

pypt commented 8 years ago

As Solr code repo cleanup (#23) is done (but not deployed), I'm slowly working on the upgrade branches in order to be able to do the upgrade on Friday early morning too.

So far got Solr 5.0.0 to start and run fine. Solr 5.5.2 has some undocumented ZooKeeper changes that I'm trying to figure out. Hope I'll manage to get to 6.1.0 before Friday.

As for the configuration changes, I'm just merging newer Solr version's configuration to older version's config and cherry-picking the changes from new config, e.g. see 143ae6c.

pypt commented 8 years ago

Well, that was quite a journey, but I managed to implement and test out the 4.6 -> 4.10 -> 5.0 -> 5.5 -> 6.0 -> 6.1 upgrade:

8 nodes on Solr 6.1.0

Solr version support is implemented in their respective branches:

Done:

Configuration changes:

Loose upgrade plan:

Should we do this in steps, or can I just shutdown Solr for half a day or so? Upgrading indexes might be the lengthy part in this plan, I'm not sure if I'd manage to do the upgrade in an hour or so.

pypt commented 8 years ago

Tried to do upgrade on live data. Suffice to say that it didn't work, basically due to OOM errors which were more even more prominent with Solr 6.1.0.

Currently uploading dump from Faith to S3 in order to replicate the issue on EC2 machines.

pypt commented 8 years ago

Upgraded test dump to Solr 6.1.0, created three 600 GB EBS volumes (+snapshots for backup) for testing our Solr setup on AWS.

About to create three test EC2 instances resembling mcquery* and try to replicate situations leading to OOM. Deciding between m4.10xlarge (40 vCPU, 160 GB RAM, $2.394/hour) and r3.4xlarge (16 vCPU, 122 GB RAM, $1.33/hour).

pypt commented 8 years ago

I've set up test Solr deployment on EC2. Three m4.10xlarge machines cost $2.39/hour each, so they're stopped now, please start them manually if you want to have a look yourself. Your SSH key is set up there too. To access Solr, create a SSH tunnel to mediacloud-solr-6-mcquery2:7981.

Some superficial findings:

pypt commented 8 years ago

Why it's so slow on Solr 6 specifically? I'll try running Solr 6.1.0 with 4.6.0's schema to make sure I didn't mess up the configuration when upgrading.

Some good news - 4.6.0 configuration seems to work fine on 6.1.0 without any slowdown or OOM problems, so some new config property between 4.6.0 and 6.1.0 is probably the culprit.

Working on figuring out which one it could be.

hroberts commented 8 years ago

Here are the fields we're keeping in the index, with comments for each:

pypt commented 8 years ago

I went through the full config again, and apparently I have accidentally set docValues="true" (confirmed in a7dd72c, reverted back in 85404e1) while doing the 4.6.0 -> 4.10.4 -> 5.0.0 -> 5.5.2 -> 6.0.0 -> 6.1.0 migration (it was a default proposed setting in one of Solr's example schemas).

I'm not quite sure what Solr was doing with the new argument on all fieldTypes, but documentation says that If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues, so it was probably doing that.

So, now Solr 6.1.0 seems to be running just fine with 2 TB of data and pretty much doesn't use any memory at all. API calls tested with live data:

Here are the fields we're keeping in the index, with comments for each:

Thanks, I have moved your comments into schema.xml. Let's move schema cleanup into a separate task.