openzipkin / brave

Java distributed tracing implementation compatible with Zipkin backend services.
Apache License 2.0
2.36k stars 713 forks source link

ArrayIndexOutOfBoundsException from MutableSpan #1384

Closed karkum closed 5 months ago

karkum commented 1 year ago

Describe the Bug

When adding tags to an ongoing span, we occasionally see the following exception:

java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds for length 4
  at brave.handler.MutableSpan.tag(MutableSpan.java:883)
  at brave.RealSpan.tag(RealSpan.java:112)
  at brave.SpanCustomizerShield.tag(SpanCustomizerShield.java:35)
  at brave.CurrentSpanCustomizer.tag(CurrentSpanCustomizer.java:41)

I believe this occurs in rare race-conditions in which some tags are removed while others are being added. I will attempt to reproduce this with a test, but I'm not having any luck so far. Wanted to create this issue and see if others are seeing anything similar.

Saravana-PS commented 1 year ago

Hey @karkum , I would like to work on this issue. Can you assign it on me?

SomayajuluSharma commented 6 months ago

Hello @karkum is anyone working in this?

codefromthecrypt commented 6 months ago

I think @reta is looking to remove synchronized for a different concurrent pattern, so whatever is missing here probably a good time to fix it without resurrecting synchronized ;)

codefromthecrypt commented 6 months ago

so currently it should be guarded by this in RealSpan

  @Override public Span tag(String key, String value) {
    synchronized (state) {
      state.tag(key, value);
    }
    return this;
  }

ScopedSpan is unguarded, but that's also documented as not thread safe.

The culprit is Tag.tag which accesses the mutable span without synchronizing it. @reta lemme know if you want me to add synchronized to undo later, or if you feel you're close to removing synchronized. I kindof lean towards doing synchronized and releasing a patch to fix it..

reta commented 6 months ago

@codefromthecrypt just picked https://github.com/openzipkin/brave/issues/1426 up, will address this issue as well, thanks a lot for figuring out the cause (and the fix).

reta commented 6 months ago

I kindof lean towards doing synchronized and releasing a patch to fix it..

I was looking into Tag.tag, particularly this snippet:

   if (span instanceof SpanCustomizer) {
      ((SpanCustomizer) span).tag(key, value);
    } else if (span instanceof MutableSpan) {
      ((MutableSpan) span).tag(key, value);
    }

I think it is rather difficult to fix it it just with synchronized or any kind of locking here: yes, we could synchronize over span instance (I think this is your idea), but the MutableSpan.tag could still be accessed directly and this access will be unguarded. Does it make sense? Thank you.

codefromthecrypt commented 6 months ago

might have been a red herring.. for example, a user cannot access the mutable span until it is inside a span handler, which should already be locked at that point. sorry if I led you down the wrong path.

Basically we don't want to lock fine grained mutablespan as it will explode overhead in ways maybe our benchmarks aren't written to show. If we locked internally in mutablespan, there are existing code which iterate through all things and update many things. So, better to change nothing until sure.

reta commented 6 months ago

Thanks @codefromthecrypt

might have been a red herring.. for example, a user cannot access the mutable span until it is inside a span handler, which should already be locked at that point. sorry if I led you down the wrong path.

Aha I see, will look more closely into it (definitely not suggesting we locked internally in mutablespan at the moment, I agree)