Closed benlangfeld closed 10 years ago
Subscribing @lssorrow, @bklang, @vindir and @sfgeorge who have an interest in this.
Adding @Jared-Prime
I can think of nothing that would cause this unless the "foo" object was leaking in from another JRuby runtime in the same JVM (Ruby objects are not passable between JRuby instances...or at least a lot of weird things happen).
If there's concurrency involved and either the Bar or Foo classes have not yet completed initialization, I suppose it might be possible...but even that seems somewhat unlikely. Perhaps a bit more likely is if two threads jump in and try to create the same class at the same time and one of them wins. I'll check for that possibility in the codebase.
General recommendation to boot everything before threads start applies here.
For the code in question, I'd recommend putting some logging in the else clause to grab the object's class, that class's object_id, and the object_id of the class you would have expected it to match.
This is what is probably happening:
Foo = Class.new # Thread-1
f = Foo.new # Thread-1
Foo = Class.new # Thread-2
case f
when Foo then # Thread-2 replaced original constant value for Thread-1
puts "A"
else
puts "B"
end
One thing which should be in console or your log is a warning like:
-e:1: warning: already initialized constant Foo
To me this seems like the only obvious scenario based on the snippet you posted above. I don't consider this to be a JRuby bug if this is what is happening. If you have multiple threads creating classes to the same constant then there will be weird races like this. Does this make sense?
Interestingly the #object_id
of the comparison class (Bar) and the superclass of the object being compared (foo, which is a Foo which inherits from Bar) come out the same when this happens, so I don't think it's threads racing to define the classes. I'm also confident that all relevant classes are loaded before any threads are started. We also don't see any warnings like the one you mentioned.
Very confusing issue!
More questions:
We are also willing to look at your app directly, if that can be arranged (we have signed NDAs in the past for such things, or just gentleman's agreement).
Ahh...and perhaps the most important question: JRuby versions you've tried? JRuby 1.7.5 is coming out very soon, and perhaps it is fixed there or unbroken in earlier versions.
More questions:
- Do the classes in question define their own === at some point?
Nope, but they do override #eql?
and #==
, though from Object docs I wouldn't expect this to have an effect. Also note that this cannot be reproduced at low load, which would indicate that something this fundamental was not a contributing factor.
- Does JVM version or interp/compiled (-X-C, -X+C) have any effect on it?
We'll play around with that and report back.
- Have you come any closer to a reproducible case?
Unfortunately not a simple one. We will, however, attempt to trim down our (rather large) application to something more manageable and make that public.
We are also willing to look at your app directly, if that can be arranged (we have signed NDAs in the past for such things, or just gentleman's agreement).
If we cannot reproduce in a more basic Adhearsion application then we'll be in touch offline to discuss this. /cc @sfgeorge.
Ahh...and perhaps the most important question: JRuby versions you've tried? JRuby 1.7.5 is coming out very soon, and perhaps it is fixed there or unbroken in earlier versions.
All of this is on JRuby 1.7.4. We'll attempt a reproduction on master, as well as JRuby 1.7.2 (we can't use 1.7.3 due to a Fiber bug which broke Celluloid) and report back.
Headius, would there perhaps be something more recognizable I could google off of to find some info about using -X+C and X-C in my opts? Are they perhaps the same as setting -Xcomp and -Xint in their behavior?
export JRUBY_OPTS=-X-C will end up making JRuby only interpret the Ruby and not bytecode generate anything. By default, we are in a mixed mode where most stuff starts off interpreted and the more frequently called stuff generates methods using bytecode generation. We usually ask this to eliminate the problem being related to the JIT.
We bisected #1000 down to improper self being swapped out deep in our runtime internals for define_method methods (define method blocks are "special" blocks and we missed something earlier in this development cycle in a new optimization). I am strongly suspecting you are using define_method in your Class.new? Can you try master and see if this fix resolved your problem?
We're not using define_method
, no, but we will test with master ASAP.
Ran some tests with the full app that initially triggered this error with -X-C and -X+C, but nothing gleaned from it. We can't run X+C with phoenix because there are pieces that can't compile and X-C is too slow to ever get the kind of load where we'd actually cause the ruby_speech error.
damn. I was really hoping that was it.
@vindir has confirmed that it's possible to reproduce the issue on master, and is proceeding to test 1.7.2. I'll be working shortly on a reproduction that we can make public and hopefully that will be ready today.
@vindir: You can also pass -Xjit.threshold=0 to cause methods to JIT the first time they're called (since you can't use -X+C).
I'd love to know what files can't compile, btw, so I can fix it (file a separate issue)
Thanks @headius. I filed issue 1013 with the backtrace of the compile failure.
Ran tests through with the jit threshold set to 0 and saw the same error as we've been seeing: https://gist.github.com/vindir/659333b380c04e2bf7e8
1.7.2 is tested. No change in the error seen.
Ok, interesting. That is a small relief since there have been a lot of changes since 1.7.2 and none of them appear to have caused this.
Still willing to connect up to your system or something to get at a reproduction, but having one in hand would be best. I'm not sure how to proceed on this at the moment :-(
@headius Yes I think connecting to our system is the best way to give a reproduction of the issue. I sent you an email to get the ball rolling. Let me know if you didn't receive it. Thanks very much!
I don't seem to have the email from you :-( Please resend to headius@headius.com.
@headius Thanks for your help on this. We're willing to do anything we can to support resolution of this issue. I'm not sure if my email got through though. Do you have another email or method of contact that you'd recommend using?
I have your email...just been kinda swamped with 1.7.5 release.
We're planning to do 1.7.6 as a quick flip for specific issues, since 1.7.x has been branched into maintenance (we won't be doing big hit rewrites like in 1.7.5). I will respond to your email and be available to investigate this later this week or early next.
Thanks for your patience!
On Tue, Oct 1, 2013 at 5:26 PM, Stephen George notifications@github.comwrote:
@headius https://github.com/headius Thanks for your help on this. We're willing to do anything we can to support resolution of this issue. I'm not sure if my email got through though. Do you have another email or method of contact that you'd recommend using?
— Reply to this email directly or view it on GitHubhttps://github.com/jruby/jruby/issues/999#issuecomment-25497339 .
I have all the VPN info I need to log in, so we could look at this any time today or tomorrow. Next week I'm in Paris on vacation, so my availability will be limited.
Hey @headius, thanks for joining us to take a look at the JIT weirdness going on.
We're seeing the issue still after implementing a temporary if/else monkey patch to avoid the case we were hitting in embed. It looks like the failures can still occur using if/else once load gets high enough.
Error gist: https://gist.github.com/vindir/fb9905c54df1396888da
Grrr. That throws my whole lead in the toilet then.
If you feel safe about using the eval version that would be the next thing to try. I'm 99% sure it would be safe since it's not evaluating user input...merely evaluating the exact same Ruby code you have in place already.
If the if/else version is failing too there's something else weird going on here, like === not working right.
Another thing to try: wrap the code in a mutex to ensure only one thread is executing it at the same time.
I'm starting to believe the real problem is not in this code but somewhere else.
We can hook up VPN again perhaps.
If it's JIT related you may be able to get it to trigger sooner by passing -Xjit.threshold=<something smaller than 50> to force code to JIT sooner.
Another suggestion from @enebo: try changing the ::Element line in that method to be const_get(:Element) instead. We have a theory that it could be the blah::Element lookup failing due to some undiscovered problem in JRuby.
I will explore the :: possibility on my end.
Thanks @headius, we'll give those a try. Curious if the eval option will affect performance, but we can try it anyway - just to give some more info for diagnosis.
Your VPN remains intact, you're always welcome to dig in further if you have free cycles. :+1:
Thanks!
Another suggestion: try the if/else version but switch the cls === obj calls to obj.kind_of?(cls). That would eliminate === as the cause if it starts working ok.
Still investigating :: logic but not having much luck finding an issue.
Oh, and to confirm...I'm not sure if you have managed to update to 1.7.5 (1.7.6 is even better) but there were a number of ::Constant lookup fixes in 1.7.5 that could be relevant here.
I don't have free cycles at the moment but if you can throw some of these ideas at the problem (on 1.7.5 or 1.7.6) it would help us narrow things down.
Thanks for the possible alternatives.
Going through these ideas and currently trying the const_get(:Element) idea. Updates shortly.
The regression spec I added above exercises the ::Element theory. It fails hard and fast on 1.7.4- and works on 1.7.5+. It is a very good candidate for your issue.
We're not currently on 1.7.5, but I'll test it as soon as we can get it on the box.
Looks like on 1.7.4 switching over to using const_get(:Element) did the trick. We're at max load and not seeing any errors popping up from RubySpeech.
Thanks for the idea @headius and @enebo we'll run with this temporarily. Looks like using const_get here might actually be slightly faster as well, but we'll need to profile it outside of the app.
@headius I've run tests against 1.7.5 and 1.7.6 today, but things don't look too hot.
We see a ~35% drop in performance in the overall app moving from 1.7.4 to either of the new releases. No specific errors related to the performance drop to report but I can open a separate ticket for that if you'd like, just not much to put in it.
Not sure if RVM or JRuby itself does the lib validation, but we also saw an error with libffi validation during installation of 1.7.5 and 1.7.6. Opened an issue for that one assuming it may be the root of the performance but installation seems to carry on without it succeeding. Issue #1155 has the details for that.
@vindir Ok, interesting. The biggest change in 1.7.5 was the rewrite of large parts of the transcoding subsystem, and I knew it might be a performance hit I intended to fix in a maintenance release.
Do you have any setup for profiling your app?
Is there a lot of encoded text data being passed around (especially IO-related) in your app?
I will start poking at perf improvements for the existing logic today.
Oh, and if we have determined that 1.7.5/1.7.6 fix the original case/when problem, we should resolve this and open a new bug for the perf issues.
I'm going to optimistically resolve this because I'm pretty sure the colon2 issue was the cause.
Please file a new issue for performance degradation in 1.7.5/1.7.6.
Thanks a million for all of your time spent helping to sort this one out. Genuinely appreciate the assistance.
I think @benlangfeld told me you'd mentioned the transcoding system rewrite as a likely cause. Our app profiling mostly revolves around functional testing with SIPP to give us concurrent and call-per-second performance numbers, so we've got plenty of that history handy. We don't have any standard setup to run through ruby-prof or similar if that's what you mean though.
Could you narrow down the text encoding question some? Probably no more than your typical app. We have to handle some xml and we deal with a ton of string parsing with the data we get from the asterisk AMI/AGI APIs. I'll open up an issue shortly for the performance issues to reiterate the above since this is probably the wrong place to ask.
As far as the case/when/:::/if/etc problem, we didn't see it on 1.7.5 and 1.7.6 so that's awesome, but with the lower performance we only stayed on the new releases long enough to run a few rounds of testing through. We can open another issue if it creeps back up.
@vindir Ok, thanks for your patience.
Encoding-wise, you'd see hits any time you're doing String#encode(!), using Encoding::Converter, or reading/writing an IO/File that needs transcoding. It sounds like you may have plenty of this.
I'm not sure if you've ever hooked up a Java profiler, but I'd recommend YourKit as a good (commercial) option and VisualVM as a free option. Since we're profiling in development and not production, you should be able to use Oracle's Java Mission Control as well.
All three options provide sampling profiles at the Java level that may give us some hints as to where perf is going. You can also start up JRuby with --sample (-Xprof passed to underlying JVM) for global sampled profiling that prints out on exit. Sampled profiles aren't super accurate, but they'll find egregious issues.
I will look for your performance issue.
@vindir did you ever file an issue for the performance degradation? I thought of another case that could be a more likely suspect: Enumerator#next use may be spinning up a thread for more cases than in 1.7.4.
So, this is going to be wooly because I'm struggling to pin down details, but hopefully someone can give some general ideas.
The basic situation I'm talking about is this:
It seems that under certain conditions (heavy load, this is part of an Adhearsion application, so that means a large number of active threads and a lot of CPU usage building XML documents over and over), this will fall through to the
else
statement on JRuby, but this issue has not been observed on MRI.Unfortunately, I don't know yet how to reproduce this in a minimal fashion, but I was hoping someone familiar with the implementation of case and inheritance in JRuby could tell me if there were any potential conditions that spring to mind under which this kind of thing might be possible?