karmi / retire

A rich Ruby API and DSL for the Elasticsearch search engine
http://karmi.github.com/retire/
MIT License
1.87k stars 533 forks source link

Using tire in unit tests and elasticsearch yellow/green statuses #537

Closed vsespb closed 11 years ago

vsespb commented 11 years ago

Do I understand correctly that it's impossible to use tire in unit tests? Before each search request we need to wait for cluster yellow/green status?

http://grokbase.com/t/gg/elasticsearch/12ag6njpyz/no-active-shards-in-tests-using-the-java-api

http://www.elasticsearch.org/guide/reference/api/admin-cluster-health.html

https://github.com/elasticsearch/elasticsearch/blob/a0d1b9b695beef84179e5b7391eaae3abffc3702/src/main/java/org/elasticsearch/rest/action/admin/cluster/health/RestClusterHealthAction.java#L56

https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/wPKUJXiG2mw

Is this known issue?

karmi commented 11 years ago

I don't understand what you mean. In unit tests, you usually mock Tire integration, in integration tests, you let everything bubble up and down, calling Index#refresh is enough. There are some pages on the Wiki.

vsespb commented 11 years ago

I've read wiki. We are using index#refresh but seems that is not enough.

Our problem is:

We use tire in integration tests (we don't mock it) At least last half of year we have random test failures on different servers, on CI server and development machines. On some machines it's 100% failures for some tests, for other it's random. And we have more failures when running tests in parallel (parallel_test gem). (index names of course different for parallel processes)

There are two kind of failures: a) HTTP 500 with "no active shards" message b) Wrong items in search results. for example we expect one item, but there are two with same ID (and other fields).

Seems that Adding sleep(1) after tire.index.refresh helps, at least I did not ever see that it did not help. Also seems the following code helps (instead of sleep):

Tire::Configuration.client.get "#{Tire::Configuration.url}/_cluster/health?wait_for_status=yellow&timeout=1s&wait_for_active_shards=1"

I found some evidences that that health requests should be indeed used:

1) https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/wPKUJXiG2mw (it's the author of elasticsearch, who answered this ticket)

2) http://grokbase.com/t/gg/elasticsearch/12ag6njpyz/no-active-shards-in-tests-using-the-java-api

karmi commented 11 years ago

Your problem is then related to running the tests in parallel, possibly even running into issues with multiple elasticsearch nodes forming a cluster in the CI environment?

Normally, a refresh is the only thing you need to do. Are you sure you're running a refresh in test setups? Does the Tire integration tests run fine for you? Also, it may help if you'd create the index with a single shard in test environment.

If nothing helps in your case, we may need to add an API for waiting for specific cluster health (color, number of active shards, etc).

vsespb commented 11 years ago

Problem happens also when running single-process tests, but rarely. We are using refresh. Failures appear even if refresh used right before search.

Ok, I'll check Tire integration tests, and check if we have signle shard.

karmi commented 11 years ago

First, try running a single test repeatedly, if you also run into the issue.

If you parallelize tests, you have no isolation between test runs, ie. setup from second test can wipe data in elasticsearch for first test. You may mitigate the issue by using different index names per test, for instance when running the tests as part of integration test suite of a Rails application.

vsespb commented 11 years ago

Yes, we definitely are using different index names per each concurrent process. Well, ok, thanks for the info. I will check all again. You can close the ticket probably, I will re-open and come with proof-of-concept minimal code in case nothing help.

vsespb commented 11 years ago

Hm, seems latest Tire's integration test is failed on my box (one of test random failing)

1)

[REQUEST FAILED] curl -X GET 'http://localhost:9200/active_record_articles/active_record_article/_search?size=10&pretty' -d '{
  "query": {
    "query_string": {
      "query": "[x"
    }
  },
  "size": 10
}'

2)

[REQUEST FAILED] curl -X GET 'http://localhost:9200/index-aliased/_search?pretty' -d '{
  "query": {
    "match_all": {

    }
  }
}'

3)

    ERROR (8:00:30.631) test: Percolator when percolating a document should return an array of matching query names. 
          undefined method `sort' for nil:NilClass
        @ test/integration/percolator_test.rb:66:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

     FAIL (8:00:35.473) test: Percolator when percolating a document should return an empty array when no query matches. 
          <[]> expected but was
          <nil>.
        @ test/integration/percolator_test.rb:61:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

4)

this test is random failing:

     PASS (8:00:26.640) test: Percolator when percolating a document should return an array of matching query names for specific percolated queries. 

sometimes

FAIL (10:00:55.617) test: Percolator when percolating a document should return an array of matching query names for specific percolated queries. 
          <["weather"]> expected but was
          <nil>.
        @ test/integration/percolator_test.rb:71:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /home/vse/.rvm/gems/ruby-1.9.3-p0@set1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

elasticsearch version: "version" : { "number" : "0.19.9", "snapshot_build" : false },

unit tests seems ok.

vsespb commented 11 years ago

However, no, unit tests have random failures too:

Tire::IndexTest
     PASS (0:00:00.013) test: Index analyze support should properly encode format parameter. 
     PASS (0:00:00.014) test: Index analyze support should properly encode parameters for analyzer. 
     PASS (0:00:00.014) test: Index analyze support should properly encode parameters for field. 
     PASS (0:00:00.014) test: Index analyze support should send text to the Analyze API. 
     PASS (0:00:00.014) test: Index mapping should create index with mapping. 
     PASS (0:00:00.015) test: Index mapping should return the mapping. 
[ERROR] 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}, retrying (1)...
[ERROR] 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}, retrying (2)...
[ERROR] 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}, retrying (3)...
[ERROR] 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}, retrying (4)...
[ERROR] 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}, retrying (5)...
[ERROR] Too many exceptions occured, giving up. The HTTP response was: 400 > {"error":"Failed to derive xcontent from org.elasticsearch.common.bytes.BytesArray@0"}
400 : {"error":"MapperParsingException[mapping [model_with_incorrect_mapping]]; nested: MapperParsingException[No handler for type [boo] declared on field [title]]; ","status":400}
  * DEFERRED: Finders should raise error when document is not found. 
  * DEFERRED: Persistent model attribute methods should allow to set deeply nested attributes on initialization. 
  * DEFERRED: Persistent model attribute methods should allow to set deeply nested attributes on update. 
  * DEFERRED: Model::Search should not define destroyed? if class already implements it. 
karmi commented 11 years ago

Something must be weird in your environment... When I run:

for i in {1..15}; do ruby -I lib:test test/integration/percolator_test.rb; done

I get no "random errors"...

The [REQUEST FAILED] ... "query": "[x" is a legitimate case -- a test for incorrect Lucene query.

The same with ModelWithIncorrectMapping.

vsespb commented 11 years ago

I added "sleep(1)" after

      context "when percolating a document" do                                                                                                                                                                                                                                 
        setup do                                                                                                                                                                                                                                                               
          @index.register_percolator_query('alert') { string 'warning' }                                                                                                                                                                                                       
          @index.register_percolator_query('gantz') { string '"y u no match"' }                                                                                                                                                                                                
          @index.register_percolator_query('weather', :tags => ['weather']) { string 'severe' }                                                                                                                                                                                
          Tire.index('_percolator').refresh     
sleep(1)                                                                                                                                                                                                                                                                       
        end 

that fixed my tests!

     PASS (0:00:04.309) test: Percolator when percolating a document should return an array of matching query names for specific percolated queries. 
     PASS (0:00:09.585) test: Percolator when percolating a document should return an array of matching query names. 
     PASS (0:00:14.885) test: Percolator when percolating a document should return an empty array when no query matches

That is absolutely same situation like I have with my Rails app (note - on several different servers).

I tried with ruby 1.9.3p0 and 1.9.3.p194, elasticsearch versions 0.19.3 and 0.19.9 (always using RVM).

Elasticsearch setup - not sure, probably default configuration.

Maybe you have idea what this can be related to?

karmi commented 11 years ago

If sleep(1) solves it, either the refresh call is incorrect (done on a different index, etc), or there's something funky with the whole setup. Really, refresh is the only thing you need to do in tests... Again, I've run the percolator integration tests 15 times without a single error.

In your app, you could have some concurrency stuf, etc., but Tire tests should be stable.

I'm afraid that without knowledge about the whole system I can't be of much help. Try building an isolated environment without access to outside world eg. with Vagrant and running the Tire test suite inside it.

vsespb commented 11 years ago

So, I installed it into empty virtual machine, same failures. Same - fixed by sleep(1)

(Empty Ubuntu 12.04 LTS 64bit, Desktop edition. (in VirtualBox, host os is Ubuntu 10.04 64bit))

below is my install script + new failure log.

# under root
apt-get install -y build-essential git libyaml-dev  libssl-dev libreadline5-dev zlib1g-dev libcurl4-openssl-dev libsqlite3-dev redis-server

cd /tmp
wget http://ftp.ruby-lang.org/pub/ruby/1.9/ruby-1.9.3-p327.tar.gz && tar xvzf ruby-1.9.3-p327.tar.gz && cd ruby-1.9.3-p327
./configure && make -j 4 && make install
# make test is ok too
gem update --system
gem update
gem install bundle

aptitude install openjdk-6-jre
aptitude install elasticsearch

# under user

cd ~
mkdir tiretest && cd tiretest
git clone https://github.com/karmi/tire.git
# revision e76fcf214f1fed648155593920794a79daa3dce5)
cd tire
bundle install
bundle exec rake test:integration
Tire::IndexStoreIntegrationTest
     PASS (1:00:56.344) test: Removing documents from the index should remove document from the index. 
     PASS (1:00:57.390) test: Retrieving documents from the index should retrieve document from the index. 
     FAIL (1:00:58.490) test: Retrieving documents from the index should return nil when retrieving missing document. 
          Expected <Item id: nil, _type: nil, _index: nil, _version: nil> to be nil.
        @ test/integration/index_store_test.rb:105:in `block (2 levels) in <class:IndexStoreIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

Unit tests are without failures.

Note: I've used default elasticsearch and redis configs. Elasticsearch version is

{
  "ok" : true,
  "status" : 200,
  "name" : "Beaubier, Jeanne-Marie",
  "version" : {
    "number" : "0.19.12",
    "snapshot_build" : false
  },
  "tagline" : "You Know, for Search"
}

Java version:

OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Any ideas where to move next? Please advice.

vsespb commented 11 years ago

I found similar ElasticSearch issue (however I can't reproduce it)

https://github.com/elasticsearch/elasticsearch/issues/698 ( https://gist.github.com/b5775ddf2160cf9c8dd7 )

(and maybe few others - not 100% sure that it's related) https://github.com/elasticsearch/elasticsearch/issues/233 https://github.com/elasticsearch/elasticsearch/issues/2316 https://github.com/elasticsearch/elasticsearch/issues/2267 https://github.com/elasticsearch/elasticsearch/issues/699

karmi commented 11 years ago

Any ideas where to move next? Please advice.

How many shards and replicas is the index created with when the tests fail?

(In my case, I have ES set up that number_of_shards:1, number_of_replicas:0 by default -- but that may not be your case.)

vsespb commented 11 years ago

Not sure how to get number of shards for that index? Do you mean need to add to Tire integration test code some request which get that number?

I have default config:

 grep number_of_shards elasticsearch.yml 
# index.number_of_shards: 5
# index.number_of_shards: 1
# The "number_of_shards" is a one-time setting for an index.

 grep number_of_replicas elasticsearch.yml 
# index.number_of_replicas: 1
# index.number_of_replicas: 0
# The "number_of_replicas" can be increased or decreased anytime,

(i.e. commented out)

I think that means 5 shards, 1 replicas

# Set the number of shards (splits) of an index (5 by default):
# Set the number of replicas (additional copies) of an index (1 by default):

Health requests returns:

curl http://localhost:9200/_cluster/health
{"cluster_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":6,"active_shards":6,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":5}
vsespb commented 11 years ago

I added to Tire test code:

          Tire.index('_percolator').refresh
 puts `curl http://localhost:9200/_cluster/health`

(instead of sleep)

so, I have:

{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":36,"active_shards":36,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":41}
    ERROR (0:00:02.671) test: Percolator when percolating a document should return an array of matching query names. 
          undefined method `sort' for nil:NilClass
        @ test/integration/percolator_test.rb:69:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'
{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":36,"active_shards":36,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":41}
     FAIL (0:00:01.544) test: Percolator when percolating a document should return an array of matching query names for specific percolated queries. 
          <["weather"]> expected but was
          <nil>.
        @ test/integration/percolator_test.rb:74:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

i.e. status is "red"

karmi commented 11 years ago

Yes, by default your index is created with 5 shards and 1 replica. A very good default, but poor for test purposes. Uncomment the Note, that for development on a local machine ... line in elasticsearch.yml and see if it makes a difference.

karmi commented 11 years ago

{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":36,"active_shards":36,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":41}

Yeah, lots of unnasigned shards (replicas), some shards still initializing...

vsespb commented 11 years ago

Hm, I uncommented

index.number_of_shards: 1
index.number_of_replicas: 0

stopped and started service. ps aux:

115       3904  3.3  2.2 1948036 90824 ?       Sl   13:09   0:03 /usr/lib/jvm/java-6-openjdk-amd64//bin/java -Xms256m -Xmx1g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.pidfile=/var/run/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-0.19.12.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/* -Des.default.config=/etc/elasticsearch/elasticsearch.yml -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.ElasticSearch

(i.e. ran with /etc/elasticsearch/elasticsearch.yml)

grep number_of /etc/elasticsearch/elasticsearch.yml
# index.number_of_shards: 5
# index.number_of_replicas: 1
index.number_of_shards: 1
index.number_of_replicas: 0
# The "number_of_shards" is a one-time setting for an index.
# The "number_of_replicas" can be increased or decreased anytime,

however I see still more than one shard:

curl http://localhost:9200/_cluster/health
{"cluster_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":6,"active_shards":6,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":5}

first test started working, other two still failed:

{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":12,"active_shards":12,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":5}
    ERROR (0:00:02.824) test: Percolator when percolating a document should return an array of matching query names. 
          undefined method `sort' for nil:NilClass
        @ test/integration/percolator_test.rb:70:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'
karmi commented 11 years ago

{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":12,"active_shards":12,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":5}

You still have unassigned and relocating shards -- drop all indices after restart (curl -X DELETE localhost:9200).

vsespb commented 11 years ago

Dropped indexes, restarted.

$ curl -X DELETE localhost:9200
{"ok":true,"acknowledged":true}

Health ok:

$ curl http://localhost:9200/_cluster/health
{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}

but tests not ok (and activeshards still big)

{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":8,"active_shards":8,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}
     PASS (0:00:01.829) test: Percolator when percolating a document should return an array of matching query names for specific percolated queries. 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   222  100   222    0     0  16081      0 --:--:-- --:--:-- --:--:-- 27750
{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":7,"active_shards":7,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":0}
    ERROR (0:00:02.904) test: Percolator when percolating a document should return an array of matching query names. 
          undefined method `sort' for nil:NilClass
        @ test/integration/percolator_test.rb:70:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   222  100   222    0     0  13607      0 --:--:-- --:--:-- --:--:-- 20181
{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":7,"active_shards":7,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":0}
     FAIL (0:00:04.018) test: Percolator when percolating a document should return an empty array when no query matches. 
          <[]> expected but was
          <nil>.
        @ test/integration/percolator_test.rb:65:in `block (3 levels) in <class:PercolatorIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'
karmi commented 11 years ago

Then you have to wait until the cluster is ready in your test setup, as you initially mentioned:

Tire::Configuration.client.get "#{Tire::Configuration.url}/_cluster/health?wait_for_status=yellow"
vsespb commented 11 years ago

It helps much, but after already I started that ticket, we found that it's actually not helping in 100% cases. Now I reproduced it: When I run test suite I am getting far less errors: but here is one

{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":8,"active_shards":8,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}
     FAIL (0:00:37.403) test: Bulk should extract the routing value from documents. 
          <1> expected but was
          <2>.
        @ test/integration/bulk_test.rb:34:in `block (2 levels) in <class:BulkIntegrationTest>'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `call'
          /usr/local/lib/ruby/gems/1.9.1/gems/shoulda-context-1.0.1/lib/shoulda/context/context.rb:398:in `block in create_test_from_should_hash'
          /usr/local/lib/ruby/gems/1.9.1/gems/mocha-0.10.5/lib/mocha/integration/mini_test/version_230_to_262.rb:28:in `run'

     PASS (0:00:37.983) test: Bulk should store a collection of documents and refresh the index. 
     PASS (0:00:41.498) test: Bulk should timeout when consistency factor is not met. 

(status is green) bulk_test is modified

      should "extract the routing value from documents" do
        @index.bulk_store [ { id: '1', title: 'A', _routing: 'a'}, { id: '2', title: 'B', _routing: 'b'} ]
        @index.refresh
 puts `curl http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=1s`
karmi commented 11 years ago

should "extract the routing value from documents" do

Runs fine for me... Really don't know what's up with your environment/setup...

vsespb commented 11 years ago

OK, I understand that, several developers tried to fixed, but no results. I'll try investigate more, but few questions:

1) do you see evidences that there is there something wrong with my elasticsearch setup ? For example number of shards in health request does not match config etc ?

2) Are you the maintainer of Elasticsearch? It might be related to elasticsearch, not Tire. Should I create ticket on elasticsearch issues page?

3) Could you please give some information about you environment (which linux distro?, is that UTC timezone?, full elasticsearch config for test, amd64 vs x86?, java version, SSD vs HDD) ?

I control now at least 3 machines where this problem appears (AFAIK there are even more). On two of them I tested Tire integration tests and it fail. One of them is empty, new virtual machine. Let me describe what is different and what is common in those machines' setup:

Different: OS: Ubuntu 12.04, Ubuntu 10.04 JAVA: Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode), OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) ELASTICSEARCH: 0.19.3, 0.19.9, 0.19.12 REDIS 2.4.14, 2.2.12 RUBY 1.9.3p0, 1.9.3p194, 1.9.3p327 Linux kernel 3.2.0-24-generic 3.2.0-32-generic 2.6.38-16-generic CPU i7-3930K i7-2600 Memory 4, 16 or 64 GB Locale LANG EN, RU Timezone: MSK, CET RVM: both: RVM and not RVM

Common: ELASTISEARCH CONFIG: default CPU: Sandy Bridge Architecture: x64 Redis config: default Linux distro: Ubuntu Install method: Debian package Timezone: Not UTC HDD: HDD, NOT SDD Ruby 1.9.3.x

Actually now Elasticsearch config is empty (all lines are comments), execept those two:

index.number_of_shards: 1
index.number_of_replicas: 0

I am going to install elasticsearch without .deb package, maybe it help (maybe there are some default values compiled in for debian package)

Also let me describe the nature of failure frequencies:

1) if some test start failing, it will be failing in 99% or 100% cases until we "fix" it 2) that same test does not necessary fail on another machine 3) test start failing when we change our app/test source code 4) most interesting: to fix test we change Rails code in testsuite, which is not related to the failing part (for example we add more logging code or we change code in different test or we move test to another place of file, or we swap some lines)

vsespb commented 11 years ago

Seems issue

     FAIL (0:00:03.555) test: Bulk should extract the routing value from documents. 
          <1> expected but was
          <2>.

is not related to the bug I am talking about. I think it's different bug. it only happens when config consists of two those lines:

index.number_of_shards: 1
index.number_of_replicas: 0

Steps to reproduce:

1) empty the config file, put only two lines (above) into it 2) run curl -X DELETE localhost:9200 3) restart ElasticSearch 4) run

for i in {1..15}; do ruby -I lib:test test/integration/bulk_test.rb; done

for revision e76fcf214f1fed648155593920794a79daa3dce5 of Tire ElasticSearch 0.19.12 (tested on two different machines) (note: sleep(1), wait for status are not helping here)

Can you confirm this?

karmi commented 11 years ago

Hi,

I've spent a great deal of time with this yesterday evening. My conclusions:

  1. I've not been able to consistently reproduce the "randomly failing tests" scenario.
  2. I've been able to observe tests failing from time to time, when running them sequentially, in a large batch. My explanation was that the virtual machine was simply not being able to withstand the load.
  3. I've been able to consistently reproduce the failure of the should extract the routing value from documents test. In an Ubuntu 12.04 virtual machine, this test fails all the time.
  4. This test passes every single run on Mac OS X 10.8.2, even when running in large sequential batches.

The single routing test does fail, not others. Sometimes there are failures, but the test suite is quite heavy, a 1GB VM is only capable to withstand a certain level of load. In my last runs, the only failing test was the routing one.

I have found a bug in the test: the test is run against an index with only one shard, thus the "filtering" based on routing value can't work. Why it works on a Mac, I have no idea.

I'm opening a new issue just for this, so we can move forward.

karmi commented 11 years ago

@vsespb See issue karmi/tire#540

karmi commented 11 years ago

At least we have a repeatable technique to run tests in an isolated environment via Vagrant. If you want to try it out:

Download or clone this gist:

git clone git://gist.github.com/4239703.git elasticsearch_vagrant && cd elasticsearch_vagrant

Run bundle install, etc. Then, configure ES:

echo '
{
  "elasticsearch" : {
    "cluster_name" : "elasticsearch_test_in_vagrant",
    "index_shards" : 1,
    "index_replicas" : 0
  }
}' > node.json

Install Ruby 1.9.3, Tire, etc:

echo '
# Install the Tire Rubygem
apt-get install make git libcurl3 libcurl3-gnutls libcurl4-openssl-dev libsqlite3-dev -y
apt-get install redis-server  -y
apt-get install ruby1.9.3 -y
update-alternatives --set ruby /usr/bin/ruby1.9.1
gem install bundler --no-rdoc --no-ri
test -d "tire" || git clone git://github.com/karmi/tire.git
chown -R vagrant:staff tire
cd tire
bundle install
' > install.sh

Build and provision the machine:

NODE_CONFIG=node.json INSTALL=install.sh time bundle exec vagrant up

Run the Tire tests:

echo '
cd tire

echo "Running Tire tests in '`pwd`'"
echo "Ruby: `ruby -v`"
echo "Java: `java -version 2>&1 | head -1`"

for i in {1..5}
  do
    echo "-------------------------------------------------------------------------------"
    echo "TEST $i"
    echo "-------------------------------------------------------------------------------"
    time bundle exec rake test
  if [ $? == 0 ]; then echo "$i: OK" >> results.txt; else echo "$i: FAIL">> results.txt; fi
done

echo; echo; echo "==============================================================================="
cat results.txt
' > run_tire_tests.sh

cat run_tire_tests.sh | xargs -0 bundle exec vagrant ssh -c
vsespb commented 11 years ago

Thank you for your hard work!

So, my recent testing show that there are 3 ways to fix random failing tests:

1) Add wait for green/yellow status call after index refresh call.

2) Set

index.number_of_shards: 1
index.number_of_replicas: 0

it helps. now when bulk_test fixed I don't see failing test anymore

3) Optimize machine performance.

All our machines are powerfull. Sandy Bridge i7 (4 or 6 cores) with 16Gb or 64 Gb RAM and 4 GB VM(host is 16Gb) But moving elasticsearch DATADIR to ramdrive (/run/shm etc) actually fixed the tests on two machines. However, for example if I decrease number of CPU for VM from 8 to 1 tests start failing again.

Seems (1) works almost always. But I saw once when it didn't help. (3) is making things much better, but not always helping. It's very sensitive to overall load. And seems (2) works always.

That my testing only related to Tire test suite. I am not sure about our application. Only waiting for failures during a work week can show if we fixed it (strange but new test failures only appear when we change something in source code, testing 100500 times same code unlikely produce new fails).

I already use method (1) for our tests. Going to apply both (2) and (3). I will report here if we see that it helped.

Thanks for Vagrant manual, I will use it next time.

p.s. "should extract the routing value from documents" test was failing only with config

index.number_of_shards: 1
index.number_of_replicas: 0

it was fine on default (empty) config. that can explain why it works on your Mac.

karmi commented 11 years ago

it was fine on default (empty) config. that can explain why it works on your Mac.

Not really :) I have the exact same setting as a default on my machine.

Add wait for green/yellow status call after index refresh call.

That should really be more then enough for any test to pass. If the application tests fail, there must be some issues with concurrency, parallelism, threads, etc. I understand a brittle test suite is a really bad thing -- that's why I wanted to get to the bottom of this issue..

vsespb commented 11 years ago

FYI

I had one test which actually generates failures in ElasticSearch (mentioned in ticket #543). I found that this test outdated and is not needed anymore, so I removed it.

After that seems that we don't have problems with elasticsearch race conditions.

I currently only run wait for yellow/green status after each reindex. (I did not change shared in config, and did not optimize performance/move elastic search to RAM drive on CI server). I didn't try removing wait for yellow/green however.

So I have a feeling that shared failures can cause race conditions in other requests to same elasticsearch server.

karmi commented 11 years ago

Yes, there is quite certainly potential for some race conditions / timing issues. Note the karmi/tire#541, which was failing for me consistently.

vsespb commented 11 years ago

yep, i watching 541.

vsespb commented 11 years ago

Hello. Thanks for fixing PercolatorIntegrationTest But sad that other test race conditions not reproduced/not investigated.

incorvia commented 11 years ago

We've also had these same issues.. used sleep.. wasn't aware of #refresh (will look for documentation on this), and will try setting the shards and other suggested options.. Thanks for this thread.