osrf / robocup3ds

Gazebo support for the RoboCup 3D simulation league.
11 stars 2 forks source link

Deadlock in RCPServer::Send() #12

Closed osrf-migration closed 8 years ago

osrf-migration commented 9 years ago

Original report (archived issue) by jasonzliang NA (Bitbucket: jasonzliang).


A deadlock condition sometimes occurs with the lock guard in the Send() method in RCPServer:

#!c++

bool RCPServer::Send(const int _socket, const char *_data, const size_t _len)
{
  if (!this->enabled)
  {
    std::cerr << "RCPServer::Send() error: Service not enabled yet"
              << std::endl;
    return false;
  }

  std::lock_guard<std::mutex> lock(this->mutex);

  bool found = false;
  for (size_t i = 1; i < this->pollSockets.size(); ++i)
  {
    if (this->pollSockets.at(i).fd == _socket)
    {
      found = true;
      break;
    }
  }

  if (!found)
  {
    std::cerr << "Socket not found" << std::endl;
    return false;
  }

  // Send data using the soket.
  auto bytes_written = write(_socket, _data, _len);
  if (bytes_written < 0)
  {
    std::cerr << "ERROR writing to socket" << std::endl;
    return false;
  }

  return true;
}

To reproduce:

  1. Start gazebo with robocup3ds plugin and connect with the magma agent (download here: http://chaosscripting.net/files/magma-protocol-tester.zip )

  2. enter the following strings (clicking send after each one)
    (scene NaoSimsparkBT)
    (init (unum 5)(teamname magmaOffenburg))

  3. Confirm that agent is loaded on field. After this send as many "(syn)" messages until the timer stops advancing.

osrf-migration commented 9 years ago

Original comment by jasonzliang NA (Bitbucket: jasonzliang).


osrf-migration commented 9 years ago

Original comment by jasonzliang NA (Bitbucket: jasonzliang).


osrf-migration commented 9 years ago

Original comment by Stefan Glaser (Bitbucket: umscht).


Hi all, the magma protocol tester has two "send" buttons. The "send init"-button sends the scene-string, the init-effector and a beam-effector at once. The other "send" button sends the text that is contained in the text-field next to it, which is initially "(syn)". So, in order to reproduce the problem I usually do the following:

  1. Start gazebo and magma protocol tester
  2. Hit "connect" in protocol tester
  3. Hit "send init" button (it should send "(scene NAO_Type_Zero) (init (unum 5) (teamname magmaOffenburg)) (beam 1 1 45.0)" as one message)
  4. Confirm that agent is loaded on field and beamed correctly
  5. Hit the "send" button several times (sending a simple "(syn)" message) until the server freezes (usually after the first couple of times the server freezes)

Alternatively you can send different messages, e.g. an action message for one of the neck joints: "(he1 1.0)"

The status bar lists the number of sent and received messages. This helps to verify that the server still sends perception messages.

The "verbose" checkbox enables or disables the output of perception-messages via std-out.

osrf-migration commented 8 years ago

Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero).


Actually, it was not a deadlock. This was the problem:

  1. We receive data from a client and we call DispatchRequestOnClientSocket().
  2. We find the socket that has reported activity and we call Parse().
  3. Parse() was trying to read 4 bytes containing the length of the following s-expression:
#!c++

int endianSize = recv(_socket, this->buffer, 4, 0);

Here's the problem because we don't have any guarantee that we're going to read the entire 4 bytes here. We could read 2 for example. The code assumed that we were reading 4 bytes.

See pull request #42 for a solution.

@umscht , @jasonzliang Could you give it a shot to confirm that the patch fixes the problem? You would have to compile robocup3d from sources.

osrf-migration commented 8 years ago

Original comment by Stefan Glaser (Bitbucket: umscht).


Hi Carlos, this definitely looks like an error source. Especially since we turned off buffering in the TCP/IP stack and send the length information byte by byte.

I tested your fix in pull request #42 and it solves this issue for our agent and the protocol-tester tool I provided. We can connect, initialize, beam and send commands without crashing.

But, I also noticed that when I disconnect my agent, the server still crashes with the following output:

RCPServer::Send() Socket not found.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Should I open another issue for that?

Thank you very much for your effort!

osrf-migration commented 8 years ago

Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero).


That seems a different problem. It would be awesome if you can create a separate issue, thanks!

You can also comment/approve the pull request.

osrf-migration commented 8 years ago

Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero).


Merged in issue_12 (pull request #42)

Fix issue #12

→ \<\<cset 2564cd30a8090452824d77fe412716118c1bf930>>