gridgain / gridgain-old

268 stars 85 forks source link

GridGain client hangs up when networkTimeout exception. #95

Open andresgomezfrr opened 9 years ago

andresgomezfrr commented 9 years ago

Hi all,

I have detected some problem when gridgain thrown networkTimeout exception, and I can simulate it, if you follow next steps:

  1. Sample client:

I build a sample gridgain client that put and get randoms K/V objects on a grid cache. My example store a object on the cache and after 100 milliseconds it queries this object.

The example's source is available on this gist: https://gist.github.com/andresgomez92/f3bf78682acaecc8cde6

When client is running, you can see some like this:

PUT  --> KEY: 484c30c9-5b46-4271-9c71-6c72a8375524 VALUE: cae8750f-fee3-41cf-8467-ac84f5b8d5b7
GET  --> KEY: 484c30c9-5b46-4271-9c71-6c72a8375524 VALUE: cae8750f-fee3-41cf-8467-ac84f5b8d5b7
PUT  --> KEY: 45d78cbd-9d88-4a40-ad91-716eb82759c3 VALUE: 95e59ea9-06a4-439c-a46e-6eca83d58b36
GET  --> KEY: 45d78cbd-9d88-4a40-ad91-716eb82759c3 VALUE: 95e59ea9-06a4-439c-a46e-6eca83d58b36
PUT  --> KEY: 5cc5438b-d93a-4cce-a17d-51449c09fc29 VALUE: 4a85e0e3-2baa-4e6a-bafd-aa7ce33f8b3b
GET  --> KEY: 5cc5438b-d93a-4cce-a17d-51449c09fc29 VALUE: 4a85e0e3-2baa-4e6a-bafd-aa7ce33f8b3b
PUT  --> KEY: f1afe780-39f0-4af9-a146-423a5cd871ca VALUE: a3d9a91e-b88c-40c8-ba49-daf4ca5d16fc
GET  --> KEY: f1afe780-39f0-4af9-a146-423a5cd871ca VALUE: a3d9a91e-b88c-40c8-ba49-daf4ca5d16fc
PUT  --> KEY: 75727e96-34d5-492d-bbe5-ac9d3974014f VALUE: a62d1ff5-0568-4de3-9d77-bd031f9a426b
GET  --> KEY: 75727e96-34d5-492d-bbe5-ac9d3974014f VALUE: a62d1ff5-0568-4de3-9d77-bd031f9a426b
PUT  --> KEY: a2edad1a-c37c-4a1a-9b53-5ec3a16b6685 VALUE: 50775a30-edc8-4c79-82f5-7578c88719ce
  1. Now I use a application to simulate packets loss, you can find the application here: https://github.com/tylertreat/Comcast

While my client is running, I enable the packets loss simulation using this command:

 comcast --device=bond1 --packet-loss=40% 

I know that 40% of lost packets is maybe high, but this isn't the problem ... when you enable the packet loss, you can see how the client is getting slower, and if you wait some minutes you get this exception:


GET  --> KEY: f2fe30a5-efcf-4247-ae73-defaef89c587 VALUE: 6d1f465a-7e88-4842-a85e-36f1d066ae2e
PUT  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
GET  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
PUT  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
GET  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
Exception in thread "main" class org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException: Cache update timeout out (consider increasing networkTimeout configuration property).
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.checkTimeout(GridNearAtomicUpdateFuture.java:301)
    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$19.onTimeout(GridDhtAtomicCache.java:1847)
    at org.gridgain.grid.kernal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:138)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    at java.lang.Thread.run(Unknown Source)

When this happen my client and java example hang up, now if I disable packet loss using this command:

 comcast --mode stop --device=bond1 

My gridgain node works fine, I can check my K/V objects using ggvisorcmd.sh, if I disable my node I can see how my gridgain client detects it, like this:

PUT  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
GET  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
PUT  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
GET  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
Exception in thread "main" class org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException: Cache update timeout out (consider increasing networkTimeout configuration property).
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.checkTimeout(GridNearAtomicUpdateFuture.java:301)
    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$19.onTimeout(GridDhtAtomicCache.java:1847)
    at org.gridgain.grid.kernal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:138)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    at java.lang.Thread.run(Unknown Source)

[11:57:27] Topology snapshot [ver=77, nodes=1, CPUs=4, heap=3.5GB]
[11:57:48] Topology snapshot [ver=78, nodes=2, CPUs=8, heap=10.0GB]

But my gridgain client can't write and query K/V objects again, he is hang up ...

I think that when the gridgain throw org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException, the client must give me a null, like if it doesn't find the specific key, and it must continue working normally.

dsetrakyan commented 9 years ago

Thanks for detailed instructions. We will try to reproduce and get back to you.

andresgomezfrr commented 9 years ago

Any update?

dsetrakyan commented 9 years ago

Can you try increasing network timeout as suggested by the exception? Default is 4000ms, so I would recommend setting it to 10000ms to give it enough time to deal with 40% packet loss.

If that does not help, we will need to take a look at the thread dumps from each node.

dsetrakyan commented 9 years ago

Also, please make sure that you are running on 6.6.2 version.