pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

run id conflicts (so far only repro on linux) #6785

Closed wisechengyi closed 5 years ago

wisechengyi commented 5 years ago

The CI for #6748 inconsistently fails at tests/python/pants_test/backend/jvm/tasks:jar_publish_integration with run id conflicts for buildstats.

./pants test --cache-test-ignore tests/python/pants_test/backend/jvm/tasks:jar_publish_integration -- -vs
...
                     E                      Waiting for background workers to finish.
                     E       00:32:00 00:14   [complete]
                     E                      SUCCESS
                     E   stderr:
                     E       .....timestamp: 2018-11-16T00:31:57.685637
                     E       Signal 15 was raised. Exiting with failure. (backtrace omitted)
                     E       
                     E       
                     E       INFO] killing nailgun server pid=12458
                     E       INFO] killing nailgun server pid=12518
                     E       
                     E       .INFO] killing nailgun server pid=12579
                     E       INFO] killing nailgun server pid=12596
                     E       
                     E       timestamp: 2018-11-16T00:32:00.199617
                     E       Exception caught: (sqlite3.IntegrityError)
                     E         File "/home/root/pants/src/python/pants/bin/pants_loader.py", line 75, in <module>
                     E           main()
                     E         File "/home/root/pants/src/python/pants/bin/pants_loader.py", line 71, in main
                     E           PantsLoader.run()
                     E         File "/home/root/pants/src/python/pants/bin/pants_loader.py", line 67, in run
                     E           cls.load_and_execute(entrypoint)
                     E         File "/home/root/pants/src/python/pants/bin/pants_loader.py", line 60, in load_and_execute
                     E           entrypoint_main()
                     E         File "/home/root/pants/src/python/pants/bin/pants_exe.py", line 39, in main
                     E           PantsRunner(exiter, start_time=start_time).run()
                     E         File "/home/root/pants/src/python/pants/bin/pants_runner.py", line 62, in run
                     E           return runner.run()
                     E         File "/home/root/pants/src/python/pants/bin/local_pants_runner.py", line 159, in run
                     E           self._run()
                     E         File "/home/root/pants/src/python/pants/bin/local_pants_runner.py", line 233, in _run
                     E           run_tracker_result = run_tracker.end()
                     E         File "/home/root/pants/src/python/pants/goal/run_tracker.py", line 457, in end
                     E           self.store_stats()
                     E         File "/home/root/pants/src/python/pants/goal/run_tracker.py", line 400, in store_stats
                     E           StatsDBFactory.global_instance().get_db().insert_stats(stats)
                     E         File "/home/root/pants/src/python/pants/stats/statsdb.py", line 82, in insert_stats
                     E           ri['version'], ri['buildroot'], ri['outcome'], ri['cmd_line']])
                     E  

source: https://api.travis-ci.org/v3/job/455670120/log.txt

Previous conversation:

john.sirois [6 hours ago] The most interesting lines were here:

                     logs/exceptions.15348.log +++ 
                     logs/exceptions.15348.log >>> timestamp: 2018-11-15T22:21:21.202556
                     logs/exceptions.15348.log >>> process title: python /home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py --no-pantsrc --pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmpeh_txb0n.pants.d --kill-nailguns --print-exception-stacktrace=True publish.jar --local=/tmp/tmp1jyhdxa1 --no-dryrun --force --override=org.pantsbuild.testproject.publish#classifiers_2.11=1.2.3 testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers
                     logs/exceptions.15348.log >>> sys.argv: ['/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py', '--no-pantsrc', '--pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmpeh_txb0n.pants.d', '--kill-nailguns', '--print-exception-stacktrace=True', 'publish.jar', '--local=/tmp/tmp1jyhdxa1', '--no-dryrun', '--force', '--override=org.pantsbuild.testproject.publish#classifiers_2.11=1.2.3', 'testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers']
                     logs/exceptions.15348.log >>> pid: 15520
                     logs/exceptions.15348.log >>> Signal 15 was raised. Exiting with failure. (backtrace omitted)
                     logs/exceptions.15348.log >>> 
                     logs/exceptions.15348.log >>> timestamp: 2018-11-15T22:21:23.030322
                     logs/exceptions.15348.log >>> process title: python /home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py --no-pantsrc --pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmpeh_txb0n.pants.d --kill-nailguns --print-exception-stacktrace=True publish.jar --local=/tmp/tmp1jyhdxa1 --no-dryrun --force --override=org.pantsbuild.testproject.publish#classifiers_2.11=1.2.3 testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers
                     logs/exceptions.15348.log >>> sys.argv: ['/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py', '--no-pantsrc', '--pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmpeh_txb0n.pants.d', '--kill-nailguns', '--print-exception-stacktrace=True', 'publish.jar', '--local=/tmp/tmp1jyhdxa1', '--no-dryrun', '--force', '--override=org.pantsbuild.testproject.publish#classifiers_2.11=1.2.3', 'testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers']
                     logs/exceptions.15348.log >>> pid: 15348
                     logs/exceptions.15348.log >>> Exception caught: (sqlite3.IntegrityError)

Note that there are two pants runs, the have the same tmp --pants-workdir and tmp --local and yet apparently different pids and run times as seen from external timestamping.

john.sirois [6 hours ago] This is interesting because - afaict - neither the test nor the code under test does re-tries, meaning there is no obvious way the same two tmp dirs get re-used. When tmp dirs, uuids and millisecond times all match, it seems pretty clear - maybe - that the process was forked after reporting initialization and so two processes shared the same mem state. Perhaps @ stuhood can comment here - he's the most familiar with the current state of the fork model and fork points.

yic [4 hours ago] Thanks for the pointer there! I will dig more

stuhood [1 hour ago] Danny should be able to help as well... I'm lightly available today due to some family stuff

jsirois commented 5 years ago

Summarizing the high-level idea on the bug: we have two processes with the same mem contents for uuids, tmp dirs and times to the millis - this implies a fork after those mem contents have been set.

wisechengyi commented 5 years ago

One theory is that because #6748 switched multiprocessing pool to thread pool, the issue became more apparent, whereas the forked process just died silently before without affecting the parent process.

wisechengyi commented 5 years ago

Added below to help debug:

diff --git a/src/python/pants/stats/statsdb.py b/src/python/pants/stats/statsdb.py
index e5ec166cc..62908e2da 100644
--- a/src/python/pants/stats/statsdb.py
+++ b/src/python/pants/stats/statsdb.py
@@ -77,6 +77,10 @@ class StatsDB(object):
       with self._cursor() as c:
         ri = stats['run_info']
         try:
+          import os
+          import time
+          with open('/home/workspae/pants_local/abc', 'a') as f:
+            f.write('{} {} {}\n'.format(ri['id'], os.getpid(), time.time()))
           c.execute("""INSERT INTO run_info VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
                     [ri['id'], int(float(ri['timestamp'])), ri['machine'], ri['user'],
                      ri['version'], ri['buildroot'], ri['outcome'], ri['cmd_line']])

Got

pants_run_2018_11_17_00_05_08_488_340963ceac4741c2ad7a8d8527486a53 38285 1542413120.62
> pants_run_2018_11_17_00_05_27_170_229832ad0095448a9d6a8d451921bd02 38743 1542413139.09
> pants_run_2018_11_17_00_05_27_170_229832ad0095448a9d6a8d451921bd02 38543 1542413140.41
pants_run_2018_11_17_00_05_46_661_43718262bcc4481fb1c43181358069c9 38801 1542413162.5
pants_run_2018_11_17_00_06_08_418_634c85b2de534c70a7d9aa65f3803034 39078 1542413184.38
pants_run_2018_11_17_00_06_30_278_aa633c18948d42bb9ca033ce90c89129 39355 1542413203.51
pants_run_2018_11_17_00_06_48_651_e4b041f0b50f4b299eaaa6520f71c781 39632 1542413219.53
pants_run_2018_11_17_00_07_05_71_4f9994b898074650b87bcb3b61e94b93 39892 1542413235.79
pants_run_2018_11_17_00_07_20_996_f4bdc02970c945c680e203f31eea313a 40152 1542413251.89
pants_run_2018_11_17_00_07_37_175_0589be6722d94dadb939dc47e8513d3a 40412 1542413267.38
pants_run_2018_11_17_00_07_52_675_9d6ffd5d631a4c56b947d5fd489e0a04 40659 1542413284.27
pants_run_2018_11_17_00_08_09_730_c8cee87d5611404280f13db900df2905 40919 1542413301.12
pants_run_2018_11_17_00_08_26_270_f580b7705c32417995f9019884b5fb19 41179 1542413317.04
pants_run_2018_11_17_00_08_43_104_f64aa32af3f94b3daf93733235e8ec1c 41439 1542413334.93
pants_run_2018_11_17_00_09_00_617_d24dea438a2a4bd6a3165aee76bca5d5 41697 1542413359.18
pants_run_2018_11_17_00_09_25_954_ead4f5e041624e89962209ad471214ae 42009 1542413382.21
pants_run_2018_11_17_00_04_40_615_8331facfc1064cf0999863b840062a43 38188 1542413382.74

Note the lines with >, the run ids are identical. Since it was printing the current pid right before writing to db, it was quite obvious that the process was forked.

wisechengyi commented 5 years ago

was able to repro the issue locally as below once every few runs. The special flake is [scaladoc] which has funky behavior with multiprocessing pool.

root@46e71dd0c530:/home/workspae/pants_local# rm -rf /tmp/123 .pants.d/build_invalidator/ && ./pants publish.jar --local=/tmp/123 --no-dryrun --no-prompt --force testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers

06:59:54 00:00 [main]
               (To run a reporting server: ./pants server)
06:59:55 00:01   [setup]
06:59:55 00:01     [parse]
               Executing tasks in goals: bootstrap -> imports -> unpack-jars -> deferred-sources -> gen -> jvm-platform-validate -> resolve -> resources -> compile -> doc -> jar -> publish
06:59:56 00:02   [bootstrap]
06:59:56 00:02     [substitute-aliased-targets]
06:59:56 00:02     [jar-dependency-management]
06:59:56 00:02     [bootstrap-jvm-tools]
06:59:56 00:02     [provide-tools-jar]
06:59:56 00:02   [imports]
06:59:56 00:02     [ivy-imports]
06:59:56 00:02   [unpack-jars]
06:59:56 00:02     [unpack-jars]
06:59:56 00:02   [deferred-sources]
06:59:56 00:02     [deferred-sources]
06:59:56 00:02   [gen]
06:59:56 00:02     [antlr-java]
06:59:56 00:02     [antlr-py]
06:59:56 00:02     [jaxb]
06:59:56 00:02     [protoc]
06:59:56 00:02     [ragel]
06:59:56 00:02     [thrift-java]
06:59:56 00:02     [thrift-py]
06:59:56 00:02     [wire]
06:59:56 00:02     [avro-java]
06:59:56 00:02     [go-thrift]
06:59:56 00:02     [go-protobuf]
06:59:56 00:02     [jax-ws]
06:59:56 00:02     [scrooge]
06:59:56 00:02     [thrifty]
06:59:56 00:02   [jvm-platform-validate]
06:59:56 00:02     [jvm-platform-validate]
                   Invalidated 1 target.
06:59:56 00:02   [resolve]
06:59:56 00:02     [ivy]
06:59:56 00:02       [cache].
                   Invalidated 1 target.
06:59:56 00:02       [ivy-resolve]
06:59:56 00:02     [coursier]
06:59:56 00:02     [go]
06:59:56 00:02     [scala-js-compile]
06:59:56 00:02     [scala-js-link]
06:59:57 00:03     [node]
06:59:57 00:03   [resources]
06:59:57 00:03     [prepare]
06:59:57 00:03     [services]
06:59:57 00:03   [compile]
06:59:57 00:03     [node]
06:59:57 00:03     [compile-jvm-prep-command]
06:59:57 00:03       [jvm_prep_command]
06:59:57 00:03     [compile-prep-command]
06:59:57 00:03     [compile]
06:59:57 00:03     [rsc]
06:59:57 00:03     [zinc]
                   Invalidated 1 target.
06:59:57 00:03       [isolation-zinc-pool-bootstrap]
                   [1/1] Compiling 1 zinc source in 1 target (testprojects/src/scala/org/pantsbuild/testproject/publish/classifiers:classifiers).
06:59:57 00:03       [compile]

06:59:57 00:03         [cache].
06:59:59 00:05         [cache].
                     Using cached artifacts for 1 target.
07:00:00 00:06         [cache].
                     Using cached artifacts for 1 target.
07:00:00 00:06         [cache].
07:00:00 00:06         [zinc]
                       [info] Compiling 1 Scala source to /home/workspae/pants_local/.pants.d/compile/zinc/a53b6931478d/testprojects.src.scala.org.pantsbuild.testproject.publish.classifiers.classifiers/current/classes ...
                       [info] Done compiling.
                       [info] Compile success at Nov 17, 2018 7:00:01 AM [0.847s]

07:00:01 00:07     [javac]
07:00:01 00:07     [cpp]
07:00:01 00:07     [errorprone]
07:00:01 00:07     [findbugs]
07:00:01 00:07     [go]
07:00:01 00:07   [doc]
07:00:01 00:07     [javadoc]
07:00:01 00:07     [scaladoc]
                   Invalidated 1 target.model contains 7 documentable templates
timestamp: 2018-11-17T07:00:03.837912
Signal 15 was raised. Exiting with failure. (backtrace omitted)

               Waiting for background workers to finish.
07:00:03 00:09   [complete]
               FAILURE

07:00:04 00:10   [jar]
07:00:04 00:10     [create]
                   Invalidated 1 target....
07:00:04 00:10   [publish]
07:00:04 00:10     [jar]
07:00:04 00:10       [ivy-publish]
               Waiting for background workers to finish.
07:00:04 00:10   [complete]
               SUCCESS
timestamp: 2018-11-17T07:00:04.994876
Exception caught: (sqlite3.IntegrityError)
  File "/home/workspae/pants_local/src/python/pants/bin/pants_loader.py", line 75, in <module>
    main()
  File "/home/workspae/pants_local/src/python/pants/bin/pants_loader.py", line 71, in main
    PantsLoader.run()
  File "/home/workspae/pants_local/src/python/pants/bin/pants_loader.py", line 67, in run
    cls.load_and_execute(entrypoint)
  File "/home/workspae/pants_local/src/python/pants/bin/pants_loader.py", line 60, in load_and_execute
    entrypoint_main()
  File "/home/workspae/pants_local/src/python/pants/bin/pants_exe.py", line 39, in main
    PantsRunner(exiter, start_time=start_time).run()
  File "/home/workspae/pants_local/src/python/pants/bin/pants_runner.py", line 62, in run
    return runner.run()
  File "/home/workspae/pants_local/src/python/pants/bin/local_pants_runner.py", line 159, in run
    self._run()
  File "/home/workspae/pants_local/src/python/pants/bin/local_pants_runner.py", line 233, in _run
    run_tracker_result = run_tracker.end()
  File "/home/workspae/pants_local/src/python/pants/goal/run_tracker.py", line 457, in end
    self.store_stats()
  File "/home/workspae/pants_local/src/python/pants/goal/run_tracker.py", line 400, in store_stats
    StatsDBFactory.global_instance().get_db().insert_stats(stats)
  File "/home/workspae/pants_local/src/python/pants/stats/statsdb.py", line 86, in insert_stats
    ri['version'], ri['buildroot'], ri['outcome'], ri['cmd_line']])

Exception message: UNIQUE constraint failed: run_info.id

Converting to ThreadPool seems to do the trick.

diff --git a/src/python/pants/backend/jvm/tasks/jvmdoc_gen.py b/src/python/pants/backend/jvm/tasks/jvmdoc_gen.py
index 1bb9458..76096b1 100644
--- a/src/python/pants/backend/jvm/tasks/jvmdoc_gen.py
+++ b/src/python/pants/backend/jvm/tasks/jvmdoc_gen.py
@@ -9,6 +9,7 @@ import contextlib
 import multiprocessing
 import os
 import re
+from multiprocessing.pool import ThreadPool

 from pants.backend.jvm.tasks.jvm_task import JvmTask
 from pants.base.exceptions import TaskError
@@ -157,8 +158,10 @@ class JvmdocGen(SkipAndTransitiveOptionsRegistrar, HasSkipAndTransitiveOptionsMi
         jobs[gendir] = (target, command)

     if jobs:
+      # Use ThreadPool as there may be dangling processes that cause identical run id and
+      # then buildstats error downstream. https://github.com/pantsbuild/pants/issues/6785
       with contextlib.closing(
-            multiprocessing.Pool(processes=min(len(jobs), multiprocessing.cpu_count()))) as pool:
+          ThreadPool(processes=min(len(jobs), multiprocessing.cpu_count()))) as pool:
         # map would be a preferable api here but fails after the 1st batch with an internal:
         # ...
         #  File "...src/python/pants/backend/jvm/tasks/jar_create.py", line 170, in javadocjar
wisechengyi commented 5 years ago

Resolved via https://github.com/pantsbuild/pants/commit/c83f6f17866a3d114c174ba282e4dff2fd9996f7#diff-391c8afdd77f7f75e4067b347834d1c2