scoverage / scalac-scoverage-plugin

Scoverage Scala Code Coverage Core Libs
https://github.com/scoverage
Apache License 2.0
426 stars 126 forks source link

"java.lang.NumberFormatException" Race condition in coverage file #19

Closed RichardBradley closed 10 years ago

RichardBradley commented 10 years ago

I'm seeing intermittent errors like the following when running coverage reports with "scoverage" on Windows:

java.lang.NumberFormatException: For input string: "143911841188188"

I think this is because of a race condition in Invoker.invoked which is causing the "id" and the ";" to be written non-atomically.

This may only be a problem on Windows. Perhaps two adjacent small appends are atomic on POSIX but not on Windows?

I will submit a patch to consolidate the two writes into one. This seems to fix the issue on Windows, and it seems like a good idea on POSIX as well.

RichardBradley commented 10 years ago

Oh, actually that hasn't fixed it. :( That's the problem with intermittent bugs -- you never know for sure when they're fixed. I just got the following error from sbt scoverage:test with this patch applied (I'm reasonably sure this patch is applied):

[error] (scoverage-test:executeTests) java.lang.NumberFormatException: For input string: "14523368368"
...
at scoverage.IOUtils$.invoked(IOUtils.scala:13)     

I take it this doesn't happen on unix?

I'll look into options for atomic file writes on Windows...

sksamuel commented 10 years ago

Might be windows new lines breaking it. Try stripping \r On 11 Mar 2014 09:41, "Richard Bradley" notifications@github.com wrote:

Oh, actually that hasn't fixed it. :( That's the problem with intermittent problems. I just got the following error from sbt scoverage:test with this patch applied (I'm reasonably sure this patch is applied):

error java.lang.NumberFormatException: For input string: "14523368368" ... at scoverage.IOUtils$.invoked(IOUtils.scala:13)

I take it this doesn't happen on unix?

I'll look into options for atomic file writes on Windows...

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37277670 .

RichardBradley commented 10 years ago

That's not it -- there's no whitespace in the file, just numbers and ";"s

On my machine, sometimes the numbers are interleaved with the ";"s:

Here's an extract:

 1452;14523368368;368;368;368;367;460;367;369;369;369;369;393;369;393;462;393;393;393;1025;1024;495;182;183;190;192;191;195;193;194;1134;1133;1023;1023;1023;1023;581;581;539;538;538;537;538;538;535;537;534;535;534;1023;535;;1023

Note the missing semicolons near the start, and the double semicolon near the end.

sksamuel commented 10 years ago

Are you running tests in parallel ? On 11 Mar 2014 09:58, "Richard Bradley" notifications@github.com wrote:

That's not it -- there's no whitespace in the file, just numbers and ";"s

On my machine, sometimes the numbers are interleaved with the ";"s:

Here's an extract:

1452;14523368368;368;368;368;367;460;367;369;369;369;369;393;369;393;462;393;393;393;1025;1024;495;182;183;190;192;191;195;193;194;1134;1133;1023;1023;1023;1023;581;581;539;538;538;537;538;538;535;537;534;535;534;1023;535;;1023

Note the missing semicolons near the start, and the double semicolon near the end.

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37279047 .

RichardBradley commented 10 years ago

Yes

sksamuel commented 10 years ago

It wasn't designed to be thread safe. I guess we need to synchronize on something in the writing code. I assume tests run in the same jvm at least.

On 11 March 2014 10:11, Richard Bradley notifications@github.com wrote:

Yes

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37280088 .

RichardBradley commented 10 years ago

It wasn't designed to be thread safe.

I thought you might have been relying on short file appends being atomic.

I think we should be able to do something like that. I don't see why they are not atomic. Probably the java "FileWriter" isn't using the win32 api in the right way to get atomic appends.

I guess we need to synchronize on something in the writing code.

Yes, locking should also do it. It might kill my test throughput.

I assume tests run in the same jvm at least.

Yes (but if we got atomic file appends working, we wouldn't even need to rely on that).

sksamuel commented 10 years ago

No one on linux has had issues but I don't know anyone who runs tests in parallel. Most people disable it because of other issues they get with other plugins. I had a google and apparently PrintWriter is atomic in the code but its not guaranteed in the docs so might work might not. Do you want to try that and if it works we can commit it.

On 11 March 2014 10:46, Richard Bradley notifications@github.com wrote:

It wasn't designed to be thread safe.

I thought you might have been relying on short file appends being atomic.

I think we should be able to do something like that. I don't see why they are not atomic. Probably the java "FileWriter" isn't using the win32 api in the right way to get atomic appends.

I guess we need to synchronize on something in the writing code.

Yes, locking should also do it. It might kill my test throughput.

I assume tests run in the same jvm at least.

Yes (but if we got atomic file appends working, we wouldn't even need to rely on that).

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37282716 .

RichardBradley commented 10 years ago

I just got the following error from sbt scoverage:test with this patch applied (I'm reasonably sure this patch is applied):

I was wrong. I changed the version number and did a "sbt publishLocal", but it seems it wasn't enough to get this patch to be used in my other project.

I'm trying various things to get this patch to be picked up so I can test this...

RichardBradley commented 10 years ago

Right -- I've chased down all the different version numbers which need changing to test this patch, and I can confirm that neither the patch proposed here, nor a version using PrintWriter fix this on Windows.

I'm going to do some stress testing of Invoker.invoked in isolation and come up with a proper fix. It may need Java-level locking, if I can't get atomic appends to work on Windows.

RichardBradley commented 10 years ago

I can't see any way around locking at the moment. Despite http://stackoverflow.com/questions/3032482/is-appending-to-a-file-atomic-with-windows-ntfs it seems that write(byte[]) can still be interleaved on Windows:

val writer = new FileOutputStream(path, true)
val bytes = (id.toString + ';').getBytes(Charset.defaultCharset())
writer.write(bytes)
writer.close()

I'll ask on StackOverflow if anyone knows how to achieve atomic file append in Java and submit a pull request with a lock.

RichardBradley commented 10 years ago

We should update the docs to note that using scoverage will introduce a global per-statement lock, so multi-threaded apps should test twice: once with coverage and once without (this is a sensible precaution anyway).

sksamuel commented 10 years ago

Yeah might be better to do deferred writing somehow. A lock per statement might be insane. On 11 Mar 2014 13:34, "Richard Bradley" notifications@github.com wrote:

We should update the docs to note that using scoverage will introduce a global per-statement lock, so multi-threaded apps should test twice: once with coverage and once without (this is a sensible precaution anyway).

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37295253 .

RichardBradley commented 10 years ago

How about a version which spawns a Future for each write and doesn't wait for the result? I'm not 100% familiar with Scala's behaviour in that case. Would the test runner need to wait for all the Futures to finish before generating the coverage report? How?

This version passes my load test:

def invoked(id: Int, path: String): Unit = {
  future {
    Invoker.synchronized {
      val writer = new FileWriter(path, true)
      writer.write(id.toString + ';')
      writer.close()
    }
  }
}
RichardBradley commented 10 years ago

I suppose the main alternative is a threadsafe queue, with a worker thread pulling "invoke"s off the queue and writing them to disk. That sounds like a fair bit of functionality to write and test (and race conditions to consider etc.).

I had a look at log4j to see how it deals with this sort of issue -- it has a global lock on the logger file appender.

sksamuel commented 10 years ago

The thing with log4j is the number of calls is a lot less. If we lock in the invoker then a statement like if (x) y else z we'll have 3 locks for just that one statement. I wonder if each thread can just write to its own file, and then the aggregator later can read them all in. What do you think.

RichardBradley commented 10 years ago

I wonder if each thread can just write to its own file, and then the aggregator later can read them all in. What do you think.

That sounds pretty sensible. I'll see if I can code it up over the next few days.

Or how about a global concurrent Set[Statement] per JVM and write it out on process exit? Would that be better?

What do other coverage frameworks do?

sksamuel commented 10 years ago

SCCT is the only other scala one, and that's not thread safe either IIRC. Things like jacoco instrument the bytecode by attaching to the JVM.

sksamuel commented 10 years ago

We could do something like val writer = new FileWriter(path + '-' + Thread.currentThread.getId, true) Then have the reader look for all files with the given prefix. That would work. Really though, the append says its atomic, everyone on stack overflow says its atomic, so why the hell is it not atomic :+1:

sksamuel commented 10 years ago

Writing out on exit should work too, although I'd have to think about how it knows when its finished. It won't be enough to use shutdown hooks because you might run the whole thing in a single process and then the report generators will fail because the file won't have been written.

sksamuel commented 10 years ago

This is what I have so far

def invoked(id: Int, path: String) = { val dir = new File(path) dir.mkdirs() val file = new File(path + "/" + Thread.currentThread.getId) val writer = new FileWriter(file, true) writer.append(id.toString + ';') writer.close() }

And the corresponding load

// loads all the invoked statement ids def invoked(dir: File): Seq[Int] = { dir.listFiles.map(file => { val reader = new BufferedReader(new FileReader(file)) val line = reader.readLine() reader.close() line.split(";").filterNot(.isEmpty).map(.toInt) }).flatten.toSeq }

RichardBradley commented 10 years ago

Things like jacoco instrument the bytecode by attaching to the JVM.

... but at some level jacoco must still have to deal with lots of "invoked" reports coming in from multiple threads. Do you know how they deal with that?

I notice that jacoco has better Scala support than it used to (see e.g. this pull request). Have you tried it lately?

Is this article still accurate? Maybe it would be easier to port statement-level coverage into jacoco rather than trying to port all of the features of a mature coverage suite into scoverage?

sksamuel commented 10 years ago

I think since jacoco uses the agent its not writing anything at all, so it won't need locks. It's just a big list of invoked statements. We have to write out because maven won't keep the jvm alive between phases.

Jacoco's support will never be as good as scoverage. The reason is with Scala the mapping from source to bytecode is not bijective, whereas with Java its closer to being bijective. If you do a pattern match in scala, then you might have things of interest for example (did you call the match, did you match case 1, did you match case 2, ....). The bytecode generated might be 30 ops or it might be 1 op. It's hard to come back from the bytecode and map into the source level statements.

That's why I believe its better to do AST manipulation.

RichardBradley commented 10 years ago

Jacoco's support will never be as good as scoverage

Fair enough. I've had a closer look since writing the above, and jacoco isn't as full-featured or as mature as I had first thought.

Maybe it would be easier to port statement-level coverage into jacoco rather than trying to port all of the features of a mature coverage suite into scoverage?

Looks like the latter then ;-) Sorry for the derail.

I think since jacoco uses the agent its not writing anything at all, so it won't need locks. It's just a big list of invoked statements.

I'm still not convinced that this isn't worth looking into -- jacoco's "big list of invoked statements" is the same as scoverages, in many ways. The list must be appended to by multiple threads. The list must be exported somehow when the JVM exits for out-of-process testing, and exported somehow when the tests finish for in-process testing. They must have had all the same problems we are having. We might be able to copy their solutions (both are open source). I'll look into it.

sksamuel commented 10 years ago

I agree that there are probably good solutions in jacoco that we can borrow. On 12 Mar 2014 10:20, "Richard Bradley" notifications@github.com wrote:

Jacoco's support will never be as good as scoverage

Fair enough. I've had a closer look since writing the above, and jacoco isn't as full-featured or as mature as I had first thought.

Maybe it would be easier to port statement-level coverage into jacoco rather than trying to port all of the features of a mature coverage suite into scoverage?

Looks like the latter then ;-) Sorry for the derail.

I think since jacoco uses the agent its not writing anything at all, so it won't need locks. It's just a big list of invoked statements.

I'm still not convinced that this isn't worth looking into -- jacoco's "big list of invoked statements" is the same as scoverages, in many ways. The list must be appended to by multiple threads. The list must be exported somehow when the JVM exits for out-of-process testing, and exported somehow when the tests finish for in-process testing. They must have had all the same problems we are having. We might be able to copy their solutions (both are open source). I'll look into it.

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-37393189 .

RichardBradley commented 10 years ago

I had a look in Jacoco. I’m not familiar with the codebase, but it appears to me that they use a single global JVM lock on each instrumented instruction. There is a class org.jacoco.core.runtime.RuntimeData which allows you to look up an “ExecutionData” by “id”, “name” and “probeCount”, which appears to be the data structure which holds the run/not-run status for a line. This access is synchronized on the RuntimeData object, which is a singleton (it is a singleton as it is a final field of the org.jacoco.agent.rt.internal.Agent class, which is explicitly a singleton). I find this fairly surprising. Won’t this mean that apps instrumented with Jacoco will be very slow and somewhat single-threaded? Maybe I’ll ping their dev team and ask.

Using one file per thread works well for me (my highly concurrent test suite now passes). It also seems likely to be more efficient and more similar to uninstrumented code than a single global lock. I think this solution is much better than a global lock.

I have made a few minor tweaks in this pull request (#25): I added that concurrency test for Invoker/IOUtils, as this is a good record that the file-per-thread solution works, and a good starting point for experimenting on alternatives in case we revisit this in the future; and I clarified the comments on Invoker.

sksamuel commented 10 years ago

I'm very surprised that my humble change is better than what jacoco have done. I think if you ping them they may come up with some scenarios we didn't think of that render my file per thread incorrect. At the very least it will validate what we are doing here.

sksamuel commented 10 years ago

In addition. I guess at the end of the day it doesn't really matter if a test takes 1ms or 10ms. It's only important in production. So I think maybe the lack of efficiency in the jacoco style lock across the entire JVM is not really an issue.

RichardBradley commented 10 years ago

I think it's also quite likely that I have misunderstood the code and that they have a per-class coverage lock, rather than a global coverage lock. I think the "probes" get locked, for writing, which is less than global.

RichardBradley commented 10 years ago

at the end of the day it doesn't really matter if a test takes 1ms or 10ms. It's only important in production

Not entirely -- some of my integration tests failed with coverage turned on because the instrumentation slowed them down so much that they timed out waiting for something to happen and reported failure. Operations which take 100ms un-instrumented now take 2s+ with coverage turned on. I now have to have two sets of timeouts configured: one for normal test runs and one for instrumented test runs.

sksamuel commented 10 years ago

Ok fair enough On 20 Mar 2014 10:03, "Richard Bradley" notifications@github.com wrote:

at the end of the day it doesn't really matter if a test takes 1ms or 10ms. It's only important in production

Not entirely -- some of my integration tests failed with coverage turned on because the instrumentation slowed them down so much that they timed out waiting for something to happen and reported failure. Operations which take 100ms un-instrumented now take 2s+ with coverage turned on. I now have to have two sets of timeouts configured: one for normal test runs and one for instrumented test runs.

Reply to this email directly or view it on GitHubhttps://github.com/scoverage/scalac-scoverage-plugin/issues/19#issuecomment-38150521 .

sksamuel commented 10 years ago

Can you run your tests single threaded with the lock and without the locks. Doesn't matter if the output is garbled, I just want to see the different in timings. Basically I'm curious to know - is it the locking mechanism taking the time or is there actual contention going on.

sksamuel commented 10 years ago

Is 0.98.0 working ok on windows now ?

sksamuel commented 10 years ago

0.98.2 handles all this well.

RichardBradley commented 10 years ago

Thanks, yes, scoverage works fine multi-threaded on Windows now.

There are some performance issues which I have been working on for this thread, but I'll raise those as separate pull requests.

RichardBradley commented 10 years ago

FYI: it looks like Java's file writer implementation doesn't set the native "FILE_APPEND_DATA" or "O_APPEND" flag when appending to files, so it looks like it won't support atomic writes on either platform: http://stackoverflow.com/a/24620026/8261

sksamuel commented 10 years ago

Oh right. Well for our use case it's not important as each thread writes to its own file now anyway.