Closed pypt closed 8 years ago
I'm up for transitioning to gradle as part of the process. I'd love not to have solr sitting in our repo.
I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?
For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.
-hal
On 6/14/16 7:50 PM, Linas Valiukas wrote:
We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).
It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.
Loose checklist:
1.
Install Solr using Gradle (#23 <https://github.com/berkmancenter/mediacloud/issues/23>) for easier upgrades.
2.
Upgrade 4.6.0 to 4.10.4 in |solr_gradle/4.10.4| branch Should be pretty safe and easy, basically a version number change in Gradle's config file.
3.
Upgrade 4.10.4 to 5.5.1 in |solr_gradle/5.5.1| branch Will need more extensive testing and some time with the changelog.
4.
Upgrade 5.5.1 to 6.0.1 in |solr_gradle/6.0.1| branch Ditto with more extensive testing.
5.
After testing the whole 4.6.0 -> 4.10.4 -> 5.5.1 -> 6.0.1 upgrade path, merge branches one by one to |release| and deploy to |mcquery*|.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/24, or mute the thread https://github.com/notifications/unsubscribe/ABvvTw8vpxe4Hbxv4hPg-4dkYKFahD7Zks5qL0xvgaJpZM4I15r9.
Hal Roberts Fellow Berkman Center for Internet & Society Harvard University
I wonder whether doing any kind of upgrading we should just be installing the new version, integrating that with our codebase, and manually moving over the new config?
Isn't this what I'm planning to do, in incremental steps?
For the data files, we should definitely be planning on just regenerating the index. One complication is that we don't have the room now on our solr disks to generate a new index while the old one is running. So I think we will have to wait for the new disks to be in place before doing the production migration.
Can we regenerate the index after the migration?
If you are just going to diff the config files and manually write the new config files from the diffs, I don't see why you would do incremental updates. I was assuming that the incremental updates were for the sake of the data files.
Yes, we can regenerate the index as part of the migration. Even if we already have the csv files dumped (the first half of the import process), it still takes a couple of days to generate the index from the csv files. If we wait until the disks are less than half full, we can generate the new index while the old one is still running and have only a few minutes downtime.
As Solr code repo cleanup (#23) is done (but not deployed), I'm slowly working on the upgrade branches in order to be able to do the upgrade on Friday early morning too.
So far got Solr 5.0.0 to start and run fine. Solr 5.5.2 has some undocumented ZooKeeper changes that I'm trying to figure out. Hope I'll manage to get to 6.1.0 before Friday.
As for the configuration changes, I'm just merging newer Solr version's configuration to older version's config and cherry-picking the changes from new config, e.g. see 143ae6c.
Well, that was quite a journey, but I managed to implement and test out the 4.6 -> 4.10 -> 5.0 -> 5.5 -> 6.0 -> 6.1 upgrade:
Solr version support is implemented in their respective branches:
solr_install_externally
- Solr 4.6.0 (our current version, installed and set up automatically; see #23)solr_4.10.4
- Solr 4.10.4 (upgrade to the last Solr 4 version)solr_5.0.0
- Solr 5.0.0 (upgrade to the first Solr 5 version)solr_5.5.2
- Solr 5.5.2 (upgrade to the last Solr 5 version)solr_6.0.0
- Solr 6.0.0 (upgrade to the first Solr 6 version)solr_6.1.0
- Solr 6.1.0 (upgrade to the latest Solr)Done:
Configuration changes:
<maxIndexingThreads>
because it's missing in Solr 5.5 sample configuration filebin/solr
Loose upgrade plan:
mcquery*
mediawords.yml
to point to mcquery2:7981
(shard 1) instead of :8983
(which is now reserved for standalone Solr instances used for development)solr_install_externally
, verify that it (still) workssolr_4.10.4
, run shards, verify that it still workssolr_5.0.0
, upgrade Lucene indexes, run shards, verify that it still workssolr_5.5.2
, run shards, verify that it still workssolr_6.0.0
, upgrade Lucene indexes, run shards, verify that it still workssolr_6.1.0
, upgrade Lucene indexes (just in case), run shards, verify that it still worksShould we do this in steps, or can I just shutdown Solr for half a day or so? Upgrading indexes might be the lengthy part in this plan, I'm not sure if I'd manage to do the upgrade in an hour or so.
Tried to do upgrade on live data. Suffice to say that it didn't work, basically due to OOM errors which were more even more prominent with Solr 6.1.0.
Currently uploading dump from Faith to S3 in order to replicate the issue on EC2 machines.
Upgraded test dump to Solr 6.1.0, created three 600 GB EBS volumes (+snapshots for backup) for testing our Solr setup on AWS.
About to create three test EC2 instances resembling mcquery* and try to replicate situations leading to OOM. Deciding between m4.10xlarge
(40 vCPU, 160 GB RAM, $2.394/hour) and r3.4xlarge
(16 vCPU, 122 GB RAM, $1.33/hour).
I've set up test Solr deployment on EC2. Three m4.10xlarge
machines cost $2.39/hour each, so they're stopped now, please start them manually if you want to have a look yourself. Your SSH key is set up there too. To access Solr, create a SSH tunnel to mediacloud-solr-6-mcquery2:7981
.
Some superficial findings:
OutOfMemory
exception even though those machines have even less RAM. How come that the shards run out of memory on production but not on EC2 instances? It can't be just load. Did you do any undocumented configuration changes on mcquery* by any chance?tickTime
(9105610) but that didn't help. I suspect this is related to FieldCache being rebuild.fieldValueCache
) getting filled up.newSearcher
and firstSearcher
properties, but haven't yet looked into them.story_sentences_id
, remove (most) of the fields that could be recovered from sentence's ID (read the rest of the data from PostgreSQL) and thus reduce the memory footprint tenfold?Why it's so slow on Solr 6 specifically? I'll try running Solr 6.1.0 with 4.6.0's schema to make sure I didn't mess up the configuration when upgrading.
Some good news - 4.6.0 configuration seems to work fine on 6.1.0 without any slowdown or OOM problems, so some new config property between 4.6.0 and 6.1.0 is probably the culprit.
Working on figuring out which one it could be.
Here are the fields we're keeping in the index, with comments for each:
I went through the full config again, and apparently I have accidentally set docValues="true"
(confirmed in a7dd72c, reverted back in 85404e1) while doing the 4.6.0 -> 4.10.4 -> 5.0.0 -> 5.5.2 -> 6.0.0 -> 6.1.0 migration (it was a default proposed setting in one of Solr's example schemas).
I'm not quite sure what Solr was doing with the new argument on all fieldTypes
, but documentation says that If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues, so it was probably doing that.
So, now Solr 6.1.0 seems to be running just fine with 2 TB of data and pretty much doesn't use any memory at all. API calls tested with live data:
sentences/count
(required small changes in eccce95)stories/count
stories/list
wc/list
tags/list
Here are the fields we're keeping in the index, with comments for each:
Thanks, I have moved your comments into schema.xml
. Let's move schema cleanup into a separate task.
We're currently running Solr 4.6.0, so before doing any performance improvements we should upgrade to Solr 6 (6.0.1 currently).
It is recommended to upgrade between major Solr versions by doing the upgrade incrementally, i.e. version 4 -> version 5 -> version 6, not version 4 -> version 6.
Loose checklist:
Upgrade 4.6.0 to 4.10.4 in
solr_gradle/4.10.4
branchShould be pretty safe and easy, basically a version number change in Gradle's config file.
Upgrade 4.10.4 to 5.5.1 in
solr_gradle/5.5.1
branchWill need more extensive testing and some time with the changelog.
Upgrade 5.5.1 to 6.0.1 in
solr_gradle/6.0.1
branchDitto with more extensive testing.
release
and deploy tomcquery*
.