jruby / jruby

JRuby, an implementation of Ruby on the JVM
https://www.jruby.org
Other
3.79k stars 922 forks source link

Sometimes JRuby fails to case on type properly #999

Closed benlangfeld closed 10 years ago

benlangfeld commented 11 years ago

So, this is going to be wooly because I'm struggling to pin down details, but hopefully someone can give some general ideas.

The basic situation I'm talking about is this:

Bar = Class.new
Foo = Class.new Bar

foo = Foo.new

case foo
when Bar
  puts "Was all fine"
else
  raise "Oops, something went wrong"
end

It seems that under certain conditions (heavy load, this is part of an Adhearsion application, so that means a large number of active threads and a lot of CPU usage building XML documents over and over), this will fall through to the else statement on JRuby, but this issue has not been observed on MRI.

Unfortunately, I don't know yet how to reproduce this in a minimal fashion, but I was hoping someone familiar with the implementation of case and inheritance in JRuby could tell me if there were any potential conditions that spring to mind under which this kind of thing might be possible?

benlangfeld commented 11 years ago

Subscribing @lssorrow, @bklang, @vindir and @sfgeorge who have an interest in this.

vindir commented 11 years ago

Adding @Jared-Prime

headius commented 11 years ago

I can think of nothing that would cause this unless the "foo" object was leaking in from another JRuby runtime in the same JVM (Ruby objects are not passable between JRuby instances...or at least a lot of weird things happen).

If there's concurrency involved and either the Bar or Foo classes have not yet completed initialization, I suppose it might be possible...but even that seems somewhat unlikely. Perhaps a bit more likely is if two threads jump in and try to create the same class at the same time and one of them wins. I'll check for that possibility in the codebase.

General recommendation to boot everything before threads start applies here.

For the code in question, I'd recommend putting some logging in the else clause to grab the object's class, that class's object_id, and the object_id of the class you would have expected it to match.

enebo commented 11 years ago

This is what is probably happening:

Foo = Class.new  # Thread-1
f = Foo.new          # Thread-1
Foo = Class.new  # Thread-2
case f                   
when Foo then     # Thread-2 replaced original constant value for Thread-1
  puts "A"
else
  puts "B"
end

One thing which should be in console or your log is a warning like:

-e:1: warning: already initialized constant Foo

To me this seems like the only obvious scenario based on the snippet you posted above. I don't consider this to be a JRuby bug if this is what is happening. If you have multiple threads creating classes to the same constant then there will be weird races like this. Does this make sense?

benlangfeld commented 11 years ago

Interestingly the #object_id of the comparison class (Bar) and the superclass of the object being compared (foo, which is a Foo which inherits from Bar) come out the same when this happens, so I don't think it's threads racing to define the classes. I'm also confident that all relevant classes are loaded before any threads are started. We also don't see any warnings like the one you mentioned.

headius commented 11 years ago

Very confusing issue!

More questions:

We are also willing to look at your app directly, if that can be arranged (we have signed NDAs in the past for such things, or just gentleman's agreement).

headius commented 11 years ago

Ahh...and perhaps the most important question: JRuby versions you've tried? JRuby 1.7.5 is coming out very soon, and perhaps it is fixed there or unbroken in earlier versions.

benlangfeld commented 11 years ago

More questions:

  • Do the classes in question define their own === at some point?

Nope, but they do override #eql? and #==, though from Object docs I wouldn't expect this to have an effect. Also note that this cannot be reproduced at low load, which would indicate that something this fundamental was not a contributing factor.

  • Does JVM version or interp/compiled (-X-C, -X+C) have any effect on it?

We'll play around with that and report back.

  • Have you come any closer to a reproducible case?

Unfortunately not a simple one. We will, however, attempt to trim down our (rather large) application to something more manageable and make that public.

We are also willing to look at your app directly, if that can be arranged (we have signed NDAs in the past for such things, or just gentleman's agreement).

If we cannot reproduce in a more basic Adhearsion application then we'll be in touch offline to discuss this. /cc @sfgeorge.

Ahh...and perhaps the most important question: JRuby versions you've tried? JRuby 1.7.5 is coming out very soon, and perhaps it is fixed there or unbroken in earlier versions.

All of this is on JRuby 1.7.4. We'll attempt a reproduction on master, as well as JRuby 1.7.2 (we can't use 1.7.3 due to a Fiber bug which broke Celluloid) and report back.

vindir commented 11 years ago

Headius, would there perhaps be something more recognizable I could google off of to find some info about using -X+C and X-C in my opts? Are they perhaps the same as setting -Xcomp and -Xint in their behavior?

enebo commented 11 years ago

export JRUBY_OPTS=-X-C will end up making JRuby only interpret the Ruby and not bytecode generate anything. By default, we are in a mixed mode where most stuff starts off interpreted and the more frequently called stuff generates methods using bytecode generation. We usually ask this to eliminate the problem being related to the JIT.

enebo commented 11 years ago

We bisected #1000 down to improper self being swapped out deep in our runtime internals for define_method methods (define method blocks are "special" blocks and we missed something earlier in this development cycle in a new optimization). I am strongly suspecting you are using define_method in your Class.new? Can you try master and see if this fix resolved your problem?

benlangfeld commented 11 years ago

We're not using define_method, no, but we will test with master ASAP.

vindir commented 11 years ago

Ran some tests with the full app that initially triggered this error with -X-C and -X+C, but nothing gleaned from it. We can't run X+C with phoenix because there are pieces that can't compile and X-C is too slow to ever get the kind of load where we'd actually cause the ruby_speech error.

enebo commented 11 years ago

damn. I was really hoping that was it.

benlangfeld commented 11 years ago

@vindir has confirmed that it's possible to reproduce the issue on master, and is proceeding to test 1.7.2. I'll be working shortly on a reproduction that we can make public and hopefully that will be ready today.

headius commented 11 years ago

@vindir: You can also pass -Xjit.threshold=0 to cause methods to JIT the first time they're called (since you can't use -X+C).

I'd love to know what files can't compile, btw, so I can fix it (file a separate issue)

vindir commented 11 years ago

Thanks @headius. I filed issue 1013 with the backtrace of the compile failure.

Ran tests through with the jit threshold set to 0 and saw the same error as we've been seeing: https://gist.github.com/vindir/659333b380c04e2bf7e8

vindir commented 11 years ago

1.7.2 is tested. No change in the error seen.

headius commented 11 years ago

Ok, interesting. That is a small relief since there have been a lot of changes since 1.7.2 and none of them appear to have caused this.

Still willing to connect up to your system or something to get at a reproduction, but having one in hand would be best. I'm not sure how to proceed on this at the moment :-(

sfgeorge commented 11 years ago

@headius Yes I think connecting to our system is the best way to give a reproduction of the issue. I sent you an email to get the ball rolling. Let me know if you didn't receive it. Thanks very much!

headius commented 11 years ago

I don't seem to have the email from you :-( Please resend to headius@headius.com.

sfgeorge commented 11 years ago

@headius Thanks for your help on this. We're willing to do anything we can to support resolution of this issue. I'm not sure if my email got through though. Do you have another email or method of contact that you'd recommend using?

headius commented 11 years ago

I have your email...just been kinda swamped with 1.7.5 release.

We're planning to do 1.7.6 as a quick flip for specific issues, since 1.7.x has been branched into maintenance (we won't be doing big hit rewrites like in 1.7.5). I will respond to your email and be available to investigate this later this week or early next.

Thanks for your patience!

On Tue, Oct 1, 2013 at 5:26 PM, Stephen George notifications@github.comwrote:

@headius https://github.com/headius Thanks for your help on this. We're willing to do anything we can to support resolution of this issue. I'm not sure if my email got through though. Do you have another email or method of contact that you'd recommend using?

— Reply to this email directly or view it on GitHubhttps://github.com/jruby/jruby/issues/999#issuecomment-25497339 .

headius commented 10 years ago

I have all the VPN info I need to log in, so we could look at this any time today or tomorrow. Next week I'm in Paris on vacation, so my availability will be limited.

vindir commented 10 years ago

Hey @headius, thanks for joining us to take a look at the JIT weirdness going on.

We're seeing the issue still after implementing a temporary if/else monkey patch to avoid the case we were hitting in embed. It looks like the failures can still occur using if/else once load gets high enough.

Error gist: https://gist.github.com/vindir/fb9905c54df1396888da

headius commented 10 years ago

Grrr. That throws my whole lead in the toilet then.

If you feel safe about using the eval version that would be the next thing to try. I'm 99% sure it would be safe since it's not evaluating user input...merely evaluating the exact same Ruby code you have in place already.

If the if/else version is failing too there's something else weird going on here, like === not working right.

Another thing to try: wrap the code in a mutex to ensure only one thread is executing it at the same time.

I'm starting to believe the real problem is not in this code but somewhere else.

We can hook up VPN again perhaps.

If it's JIT related you may be able to get it to trigger sooner by passing -Xjit.threshold=<something smaller than 50> to force code to JIT sooner.

headius commented 10 years ago

Another suggestion from @enebo: try changing the ::Element line in that method to be const_get(:Element) instead. We have a theory that it could be the blah::Element lookup failing due to some undiscovered problem in JRuby.

I will explore the :: possibility on my end.

sfgeorge commented 10 years ago

Thanks @headius, we'll give those a try. Curious if the eval option will affect performance, but we can try it anyway - just to give some more info for diagnosis.

Your VPN remains intact, you're always welcome to dig in further if you have free cycles. :+1:

Thanks!

headius commented 10 years ago

Another suggestion: try the if/else version but switch the cls === obj calls to obj.kind_of?(cls). That would eliminate === as the cause if it starts working ok.

Still investigating :: logic but not having much luck finding an issue.

headius commented 10 years ago

Oh, and to confirm...I'm not sure if you have managed to update to 1.7.5 (1.7.6 is even better) but there were a number of ::Constant lookup fixes in 1.7.5 that could be relevant here.

I don't have free cycles at the moment but if you can throw some of these ideas at the problem (on 1.7.5 or 1.7.6) it would help us narrow things down.

vindir commented 10 years ago

Thanks for the possible alternatives.

Going through these ideas and currently trying the const_get(:Element) idea. Updates shortly.

headius commented 10 years ago

The regression spec I added above exercises the ::Element theory. It fails hard and fast on 1.7.4- and works on 1.7.5+. It is a very good candidate for your issue.

vindir commented 10 years ago

We're not currently on 1.7.5, but I'll test it as soon as we can get it on the box.

Looks like on 1.7.4 switching over to using const_get(:Element) did the trick. We're at max load and not seeing any errors popping up from RubySpeech.

Thanks for the idea @headius and @enebo we'll run with this temporarily. Looks like using const_get here might actually be slightly faster as well, but we'll need to profile it outside of the app.

vindir commented 10 years ago

@headius I've run tests against 1.7.5 and 1.7.6 today, but things don't look too hot.

We see a ~35% drop in performance in the overall app moving from 1.7.4 to either of the new releases. No specific errors related to the performance drop to report but I can open a separate ticket for that if you'd like, just not much to put in it.

Not sure if RVM or JRuby itself does the lib validation, but we also saw an error with libffi validation during installation of 1.7.5 and 1.7.6. Opened an issue for that one assuming it may be the root of the performance but installation seems to carry on without it succeeding. Issue #1155 has the details for that.

headius commented 10 years ago

@vindir Ok, interesting. The biggest change in 1.7.5 was the rewrite of large parts of the transcoding subsystem, and I knew it might be a performance hit I intended to fix in a maintenance release.

Do you have any setup for profiling your app?

Is there a lot of encoded text data being passed around (especially IO-related) in your app?

I will start poking at perf improvements for the existing logic today.

headius commented 10 years ago

Oh, and if we have determined that 1.7.5/1.7.6 fix the original case/when problem, we should resolve this and open a new bug for the perf issues.

headius commented 10 years ago

I'm going to optimistically resolve this because I'm pretty sure the colon2 issue was the cause.

Please file a new issue for performance degradation in 1.7.5/1.7.6.

vindir commented 10 years ago

Thanks a million for all of your time spent helping to sort this one out. Genuinely appreciate the assistance.

I think @benlangfeld told me you'd mentioned the transcoding system rewrite as a likely cause. Our app profiling mostly revolves around functional testing with SIPP to give us concurrent and call-per-second performance numbers, so we've got plenty of that history handy. We don't have any standard setup to run through ruby-prof or similar if that's what you mean though.

Could you narrow down the text encoding question some? Probably no more than your typical app. We have to handle some xml and we deal with a ton of string parsing with the data we get from the asterisk AMI/AGI APIs. I'll open up an issue shortly for the performance issues to reiterate the above since this is probably the wrong place to ask.

As far as the case/when/:::/if/etc problem, we didn't see it on 1.7.5 and 1.7.6 so that's awesome, but with the lower performance we only stayed on the new releases long enough to run a few rounds of testing through. We can open another issue if it creeps back up.

headius commented 10 years ago

@vindir Ok, thanks for your patience.

Encoding-wise, you'd see hits any time you're doing String#encode(!), using Encoding::Converter, or reading/writing an IO/File that needs transcoding. It sounds like you may have plenty of this.

I'm not sure if you've ever hooked up a Java profiler, but I'd recommend YourKit as a good (commercial) option and VisualVM as a free option. Since we're profiling in development and not production, you should be able to use Oracle's Java Mission Control as well.

All three options provide sampling profiles at the Java level that may give us some hints as to where perf is going. You can also start up JRuby with --sample (-Xprof passed to underlying JVM) for global sampled profiling that prints out on exit. Sampled profiles aren't super accurate, but they'll find egregious issues.

I will look for your performance issue.

headius commented 10 years ago

@vindir did you ever file an issue for the performance degradation? I thought of another case that could be a more likely suspect: Enumerator#next use may be spinning up a thread for more cases than in 1.7.4.