ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
752 stars 911 forks source link

roscpp: crash in ros::TransportPublisherLink::onRetryTimer() #577

Closed stwirth closed 8 years ago

stwirth commented 9 years ago

We witnessed a crash of diagnostic_agregator/aggregator_node on one of our robots, that is what i could find:

Reading symbols from /opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node...(no debugging symbols found)...done.
[New LWP 2295]
[New LWP 2289]
[New LWP 6780]
[New LWP 2284]
[New LWP 2288]
[New LWP 2290]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/ros/indigo/lib/diagnostic_aggregator/aggregator_node __name:=diagnostic_ag'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007f59196794ef in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007f591962cb00 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() ()
   from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007f59196554f7 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007f5919656303 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007f59196825c4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007f5917bcba4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007f59179aa182 in start_thread (arg=0x7f590bfff700) at pthread_create.c:312
#7  0x00007f59184f947d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Unfortunately that happened only once on one robot. No way to reproduce it. Has anybody seen something like that or can give me a hint on how to debug this further?

rethink-imcmahon commented 9 years ago

+1 on this issue. I'm investigating a core with a very similar stack trace. It's incredibly difficult to reproduce, but I've managed to get two instances of the dump now:

Program terminated with signal 11, Segmentation fault.
(gdb) bt full
#0  atomic_conditional_increment (pw=0x3241) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:92
        rv = <optimized out>
#1  add_ref_lock (this=0x3239) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:139
No locals.
#2  shared_count (r=..., this=0x7f37ab7fdb18) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:589
No locals.
#3  boost::shared_ptr<ros::Subscription>::shared_ptr<ros::Subscription> (this=0x7f37ab7fdb10, r=...) at /usr/include/boost/smart_ptr/shared_ptr.hpp:414
No locals.
#4  0x00007f37b900b7b0 in lock (this=0x7f376c0275b8) at /usr/include/boost/smart_ptr/weak_ptr.hpp:178
No locals.
#5  ros::TransportPublisherLink::onRetryTimer (this=0x7f376c0275a0) at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/transport_publisher_link.cpp:204
        parent = {
          px = 0x0, 
          pn = {
            pi_ = 0x3239
          }
        }
        __PRETTY_FUNCTION__ = "void ros::TransportPublisherLink::onRetryTimer(const ros::WallTimerEvent&)"
#6  0x00007f37b8fc1e56 in operator() (a0=..., this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
No locals.
#7  ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call (this=0x7f371c000eb0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/include/ros/timer_manager.h:184
        tracked = {
          px = 0x0, 
          pn = {
            pi_ = 0x0
          }
        }
        event = {
          last_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727748844
            }, <No data fields>}, 
          last_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 727984536
            }, <No data fields>}, 
          current_expected = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 827748844
            }, <No data fields>}, 
          current_real = {
            <ros::TimeBase<ros::WallTime, ros::WallDuration>> = {
              sec = 1430265753, 
              nsec = 828038627
            }, <No data fields>}, 
          profile = {
            last_duration = {
              <ros::DurationBase<ros::WallDuration>> = {
                sec = 0, 
                nsec = 424
              }, <No data fields>}
          }
        }
#8  0x00007f37b8fea4ef in ros::CallbackQueue::callOneCB (this=this@entry=0x24eb900, tls=tls@entry=0x7f37a40008c0)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:382
        rw_lock = {
          m = 0x7f371c0014f8, 
          is_locked = true
        }
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        info = {
          callback = {
            px = 0x7f371c000eb0, 
            pn = {
              pi_ = 0x7f371c000910
            }
          }, 
          removal_id = 139877086151072, 
          marked_for_removal = false
        }
        cb = @0x7f37ab7fdd00: {
          px = 0x7f371c000eb0, 
          pn = {
            pi_ = 0x7f371c000910
          }
        }
        id_info = {
          px = 0x7f371c0014f0, 
          pn = {
            pi_ = 0x7f371c000a80
          }
        }
#9  0x00007f37b8fea843 in ros::CallbackQueue::callAvailable (this=0x24eb900, timeout=...)
    at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/callback_queue.cpp:333
        tls = <optimized out>
        called = 0
#10 0x00007f37b90162cc in ros::internalCallbackQueueThreadFunc () at /var/tmp/portage/sci-ros/ros-indigo-roscpp-1.11.8.0-r2/work/ros-indigo-roscpp-1.11.8.0/src/libros/init.cpp:282
        queue = {
          px = 0x24eb900, 
          pn = {
            pi_ = 0x24e7cd0
          }
        }
#11 0x00007f37bb4b594d in boost::(anonymous namespace)::thread_proxy (param=<optimized out>) at libs/thread/src/pthread/thread.cpp:164
        thread_info = {
          px = 0x24eeac0, 
          pn = {
            pi_ = 0x24eb860
          }
        }
#12 0x00007f37b8ce41da in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f37b6e11d7d in clone () from /lib64/libc.so.6
No symbol table info available.

The only useful piece of information that I can provide in reproducing the dump is that it happens (sometimes) as I kill a node that published / subscribed to the same topics that the dumping node subscribed / published to respectively.

Poking around in frame 8 (died @ line 382 of /src/libros/callback_queue.cpp), I see that the TLS is not being marked_for_removal, which causes the cb->call() to get executed on a subscription that has already been destroyed:

  │362       CallbackInfo info = *tls->cb_it;
   ...   
  │375           if (info.marked_for_removal)      
  │376           {                     
  │377             tls->cb_it = tls->callbacks.erase(tls->cb_it);      
  │378           }    
  │379           else            
  │380           {             
  │381             tls->cb_it = tls->callbacks.erase(tls->cb_it);          
 >│382             result = cb->call();                        
  │383           }         

Should / how would this callback be marked for removal?

(gdb) print info
$50 = {
  callback = {
    px = 0x7f371c000eb0, 
    pn = {
      pi_ = 0x7f371c000910
    }
  }, 
  removal_id = 139877086151072, 
  marked_for_removal = false
}
rethink-imcmahon commented 9 years ago

@dirk-thomas & @wjwwood, Have you seen any instances of this segfault at OSRF? I now have several instances from various ROS nodes, mostly on disconnect, but the core dumps are frustratingly hard to debug past a certain point (all with traces similar to the one shown above).

I'm trying to distill this crash down into a simple, reproducible example, so any thoughts or tips would be greatly appreciated.

wjwwood commented 9 years ago

@rethink-imcmahon I have not. I'll try to spend some time today or tomorrow looking at it.

rethink-imcmahon commented 9 years ago

Ok, I still don't have a nice reproducible example, but I think I have uncovered the issue. For this block of offending code where the dumps always occur - https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L362-L383

Inside CallbackQueue::callOneCB() there is a very important info variable:

362 | CallbackInfo info = *tls->cb_it;
...
368 |    boost::shared_lock<boost::shared_mutex> rw_lock(id_info->calling_rw_mutex);
...
375 |     if (info.marked_for_removal)
376 |     {
377 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
378 |     }
379 |     else
380 |     {
381 |        tls->cb_it = tls->callbacks.erase(tls->cb_it);
382 |        result = cb->call();
383 |     }

What seems odd here is that we are not referencing the original *tls->cb_it object, but making a copy of it! To make the reference properly, we need to add an &:

362 | CallbackInfo& info = *tls->cb_it;

Furthermore, if you look at another very similar block of code in the same file, you can see the reference is properly executed for a CallbackInfo object: https://github.com/ros/ros_comm/blob/indigo-devel/clients/roscpp/src/libros/callback_queue.cpp#L248-L261

248 |      CallbackInfo& info = *it;
249 |
250 |      if (info.marked_for_removal)
251 |      {
252 |        it = callbacks_.erase(it);
253 |        continue;
254 |      }

Copy vs reference? It really shouldn't matter, should it? Well in the L362-383 block it does, since there is a shared_mutex lock in between the copy and the if (info.marked_for_removal). If the conditions are just right, by the time the mutex unlocks, the info copy is stale and false, despite the fact that the original object's marked_for_removal field would have been set to true. That would cause the result = cb->call() to execute on an invalid callback and coredump.

I'm wide open for critiques of this theory & ideas on a reliable method for reproducing this crash...

rethink-imcmahon commented 9 years ago

roscomm built fine with the changes that I proposed. Perhaps I'm just lucky, but I managed to get two coredumps in a row with modified talker / subscriber roscpp_tutorial nodes. Trying to reproduce the issue (and gain more insight into the original *tls->cb_it object), I added a couple copies to bring them into the local scope. Then rebuilt. Now the nodes are spinning up and down continuously every few seconds, so I'll post if I get another useful coredump.

dirk-thomas commented 9 years ago

I have created PR #615 to add the missing reference. Even if it is unclear if it helps solving this problem not copying the callback info at that point is the "right" thing to do. (Update: but as trivial as the patch looks like it doesn't seem to pass all unit tests)

Can you please share some information about your scenario where you get the crashes with the roscpp_tutorial talker / listener. Then others might be able to reproduce it the same way.

dirk-thomas commented 9 years ago

I want to summarize my thoughts when reading through the code - may be it helps others to trace the flow faster.

The callback iterators in the thread local storage (TLS) are the ones being already popped of the queue to be called within a specific thread without requiring locks on the queue itself.

There are four lines which use marked_for_removal beside the constructor of the info object:

The second one checks the flag for callbacks in the queue and therefore does so with a locked mutex_.

The third and fouth check the flag for a callback in the TLS and therefore do so while having a lock of the calling_rw_mutex. The invocations of callOneCB happen from callAvailable which picks callbacks from the queue and puts them into the thread local storage (while holding mutex_). The invocation of each callback in the TLS is then done mutex-free (which is the idea of the TLS).

The first one actually marks all pending callbacks in the TLS for removal. The surrounding removeByID method holds no locks while setting these flags. This method is invoked from several places whenever callbacks are being removed / deregistered. This could happen asynchronously from a different thread as well as within another callback.

But since the TLS contains only thread local callbacks the block changing the flags (https://github.com/ros/ros_comm/blob/ea9ca68379175d977f8f72639cf9f067953b9e6d/clients/roscpp/src/libros/callback_queue.cpp#L193-L204) can only occur in case of an invocation from within another callback. Otherwise the TLS should not contain anything. Therefore I think it is perfectly safe to do this without additional locking.

rethink-imcmahon commented 9 years ago

Ok, thank you for the clarifications Dirk. I'm still trying to wrap my head around some of it, but the explanation is certainly helpful. Just brainstorming ways to debug this, I was thinking of adding an artificial wait around the lock to simulate a starved thread. I'll let you know if this line of thought turns up anything.

As for the talker / listener, I'll commit them to my fork and put a link here. They did segfault overnight but in another piece of initialization code. I'll look into that later, but for now this is still my focus.

rethink-imcmahon commented 9 years ago

What are the unit tests checking for exactly?

dirk-thomas commented 9 years ago

Are you referring to the unit tests failing for #615? I don't know what they are checking exactly. Will have to look closer into them but it seems they are failing due to the added reference.

rethink-imcmahon commented 9 years ago

@dirk-thomas or @stwirth, do you have any updates on this bug? I'd be curious to know why #615 unit tests were failing, and how we can fix up this patch

stwirth commented 9 years ago

@rethink-imcmahon I don't have updates. The bug never appeared again so I don't spend time on it.

dirk-thomas commented 9 years ago

Also no updates from me. I haven't spent any more time on this since the last comment. I don't expect to have time for looking further into it and therefore tried to summarize my investigation as good as possible to enable others to continue looking into it.

jasonimercer commented 9 years ago

I've observed this crash but I don't have anything meaningful to add:

Core was generated by `/home/administrator/robot_ws/devel/lib/move_base/move_base map:=slam/map cmd_ve'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
(gdb) bt
#0  0x00007ff4450347a3 in ros::TransportPublisherLink::onRetryTimer(ros::WallTimerEvent const&) () from /opt/ros/indigo/lib/libroscpp.so
#1  0x00007ff444fe7b30 in ros::TimerManager<ros::WallTime, ros::WallDuration, ros::WallTimerEvent>::TimerQueueCallback::call() () from /opt/ros/indigo/lib/libroscpp.so
#2  0x00007ff445010777 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/indigo/lib/libroscpp.so
#3  0x00007ff445011583 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/indigo/lib/libroscpp.so
#4  0x00007ff44503d8f4 in ros::internalCallbackQueueThreadFunc() () from /opt/ros/indigo/lib/libroscpp.so
#5  0x00007ff442acfa4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
#6  0x00007ff4428ae182 in start_thread (arg=0x7ff43b37e700) at pthread_create.c:312
#7  0x00007ff44433a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
rethink-imcmahon commented 9 years ago

We have many virtual machines running a wide variety of ROS tasks nightly, so this issue is causing a handful of coredumps every night. In trying to see why the fix I proposed in May didn't work, I added the "reference-instead-of-copy" patch locally. Eventually, I managed to make it core dump:

(gdb) bt full
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
        rw_lock = {m = 0x7f0ddc000938, is_locked = true}
        last_calling = 18446744073709551615
        result = ros::CallbackInterface::Invalid
        __PRETTY_FUNCTION__ = "ros::CallbackQueue::CallOneResult ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)"
        info = @0x7f0e28001910: {callback = {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}, 
          removal_id = 139698911842016, marked_for_removal = false}
        cb = @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}
        id_info = {px = 0x7f0ddc000930, pn = {pi_ = 0x7f0ddc000a80}}
#1  0x00007f0e3c5af921 in ros::CallbackQueue::callAvailable (this=0x2129460, timeout=...)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:333
        tls = 0x7f0e280008c0
        called = 0
#2  0x00007f0e3c5f5160 in ros::internalCallbackQueueThreadFunc ()
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/init.cpp:282
        queue = {px = 0x2129460, pn = {pi_ = 0x21289c0}}
#3  0x00007f0e3c6006c9 in boost::detail::thread_data<void (*)()>::run (this=0x2129cb0)
    at /usr/include/boost/thread/detail/thread.hpp:117
No locals.
#4  0x00007f0e3ab79a4a in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0
No symbol table info available.
#5  0x00007f0e3a958182 in start_thread (arg=0x7f0e34b86700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f0e34b86700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139698990769920, -5890305797539832903, 0, 0, 139698990770624, 
                139698990769920, 5790792828027591609, 5790814652555123641}, mask_was_saved = 0}}, priv = {pad = {0x0, 
              0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007f0e3b6f347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 0
#0  0x00007f0e3c5afebc in ros::CallbackQueue::callOneCB (this=0x2129460, tls=0x7f0e280008c0)
    at /data/users/imcmahon/roscpp_ws/src/ros-indigo-roscpp-1.11.8/src/libros/callback_queue.cpp:382
382         result = cb->call();
(gdb) print cb
$1 = (ros::CallbackInterfacePtr &) @0x7f0e28001910: {px = 0x7f0ddc0008c0, pn = {pi_ = 0x7f0ddc000910}}

This coredump was quite hard to produce, but after hundreds of subscribers were spun up / SIGINT, I managed to get what you see here. So it seems that before, when we were copying CallbackInfo rather than grabbing a reference to it, we were seeing the partial execution of the callback (onRetryTimer). Now, it appears the underlying result = cb->call() itself is no longer valid. I think this has to do with the pointer to the subscriber being destroyed, causing the dereference to fail. Before this patch, when callback_queue held a copy of CallbackInfo info, we could last a little longer and end up in more dangerous territory by trying to utilize a destroyed weak_ptr. According to the documentation, weak_ptr operations never throw exceptions. roscpp must have been doing something wrong then to cause the weak_ptr to throw an exception down in the assembly code as we see in the original stack trace. I looked some plausible causes for this: threads being destroyed while references to their callbacks still exist, or possibly this - the vector of subscriptions being swapped out during connection dropping, leaving a null pointer to the subscription vector in its place: https://github.com/ros/ros_comm/blob/ace5b9c11af766564441bad2cb08f5fca5b73784/clients/roscpp/src/libros/subscription.cpp#L158-L176

Despite the new core dump, think I still prefer having the reference for CallbackInfo rather than the direct copy, since having the copy lead us into a weird assembly code state, and the reference gives us a null cb->call() faster (which can be handled appropriately). I do feel like I am dancing around the underlying issue, so if anyone can give a more satisfying explanation of these errors, I'd be happy to hear it.

rethink-imcmahon commented 9 years ago

The head of our QA department sent me some stats on this bug today. We run an extensive suite of tests on a nightly basis with many virtual robots in the cloud. In just the virtual robots (not counting the instances of this bug that we find on our real hardware) he finds in the past month:

Total count of core dumps since 7/19/15: 56
Total count of test runs: 59
Total count of test suites per run: 40
On average, each suite takes 36 min.

So we saw 56 instances of the roscpp bug in total of 2360 newly cleaned robots, each of them running heavily for 36 min. 

It seems every ROS subscriber is susceptible to the crash, and the nodes that create-destroy subscriptions regularly are the worst offenders. I am still actively hunting this bug down, but I would welcome any help from others in the community. @wjwwood @dirk-thomas @jacquelinekay any thoughts on how to raise awareness on this issue in the community?

wjwwood commented 9 years ago

@rethink-imcmahon I'd like to have a crack at it, but realistically I'm going to be too busy for a few weeks. Maybe you can just send an email to ros-users@ explaining that several people have taken some time to look at the problem but it eludes us. Someone might take it as a challenge and find a problem that we have overlooked.

jacquelinekay commented 9 years ago

@rethink-imcmahon here are a few ideas for creating a test to reliably reproduce this crash (since it sounds like it happens randomly):

The fact that it's difficult to reproduce but happens more frequently when there are many subscribers makes me wonder if it happens when there's a scheduling anomaly. Can you try adding a sleep right before line 382 of callback_queue.cpp?

Or, to confirm or deny your theory about the subscriptions dropping, you could make a test where one thread spins doing some pub/sub work, and another thread causes all subscriptions to drop.

jeffadamsc commented 9 years ago

I made a little progress on this today. @jacquelinekay, thanks for the tip. I added the line "usleep(100);" before line 382 of callback_queue.cpp and that did help to reproduce the bug more often. I also modified a simple test by @rethink-imcmahon as follows and I can reproduce the bug about every minute:


#include "ros/ros.h"
#include "std_msgs/String.h"
#include <sstream>
#include <boost/thread.hpp>

void chatterCallback(const std_msgs::String::ConstPtr& msg)
{
  ROS_INFO("I heard: [%s]", msg->data.c_str());
}

void subscribeDestroy() {
  ros::Time start_time;
  double rollover_time = 5.0; // seconds
  size_t num_subs = 100;
  std_msgs::String msg;
  msg.data = std::string("hello world");
  ros::NodeHandle n;
  while(ros::ok())
  {
    start_time = ros::Time::now();
    std::vector<ros::Subscriber> subs;
    for( size_t i = 0; i < num_subs; i++ )
    {
      subs.push_back(n.subscribe("chatter", 100, chatterCallback));
    }
    usleep(500000);
    // subscribers destructors called here
    std::cout<<"Rolling over! Calling dtors now. "<<"Total time: "<<(ros::Time::now() - start_time).toSec()<<std::endl;
  }
}

int main(int argc, char **argv)
{
  ros::init(argc, argv, "listener");
  boost::thread first (subscribeDestroy); 
  ros::spin();

  return 0;
}

I've started working on solutions to the problem, but no big revelations yet.

rethink-imcmahon commented 8 years ago

@dirk-thomas I think it's safe to close this issue now that #677 is merged. Thanks!

stwirth commented 8 years ago

Awesome @rethink-imcmahon! Thanks for all the effort!

jasonimercer commented 8 years ago

Is there an ETA on when this fix will transition to the public debs? We're having lots of problems with this on the VMs here at ClearPath.

dirk-thomas commented 8 years ago

The changes in shadow-fixed have been announced six days ago (http://lists.ros.org/pipermail/ros-release/2015-October/004708.html) and we are waiting for people to test them thoroughly during that soak period. Probably within a week from now it will be synced to public.

jasonimercer commented 8 years ago

We've put the shadow build of ros_comm 1.11.15-0 on our internal apt servers here at ClearPath and are running it without incident.

dirk-thomas commented 8 years ago

Thanks for reporting back.

We will sync it once the shadow-fixed repo has no regressions. Currently 1.11.16 is being built for a different reason.