Closed iguberman closed 8 years ago
I'll fledge up the verbosity of the error reporting in case of unboud names. Next time you run into the issue, hopefully, we have some more info. Sry for the hassle.
How do I do that? There is no --verbose option in cuneiform?
This issue does not appear in the Erlang branch v2.2.0.
ok, the new 2.2.0-release solves this problem. I'll close the issue.
Please note that this has nothing to do with cf submitted (attached just in case). It failed on just one of many similar files (and the file that failed is not the largest one, however it seems related to the file size being "bigger than some threshold after which things get unstable" -- whenever I cut data of this execution in roughly half and run this same thing on semi-hourly data accumulation instead of hourly I actually have NEVER run into this issue!):
ubuntu@scheduler2:~/CF/cuneiform$ tail -100f test_2016_02_10_09_18_hourly INFO Query 0889ca69-41ba-40a1-bbd4-95a8fd619437 started. ERROR Query 0889ca69-41ba-40a1-bbd4-95a8fd619437 failed while executing ticket 4536167165. java.lang.RuntimeException: "A name 'out( File )' is not bound in this block." [trace] java.lang.RuntimeException: A name 'out( File )' is not bound in this block. at de.huberlin.wbi.cuneiform.core.invoc.Invocation.getStageOutList(Invocation.java:216) at de.huberlin.wbi.cuneiform.htcondorcre.CondorCreActor.processMsg(CondorCreActor.java:216) at de.huberlin.wbi.cuneiform.htcondorcre.CondorWatcher.preRec(CondorWatcher.java:61) at de.huberlin.wbi.cuneiform.core.actormodel.Actor.run(Actor.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ^C[1]+ Exit 255 sudo -u condor cuneiform -c -l /mnt/pdr/.cuneiform -p htcondor continuity_hourly.cf > test_2016_02_10_09_18_hourly
I wonder if a bit of retry logic might help in this case, considering its very intermittent nature and the fact that these files take a long time to transfer. Just a wild guess, so please don't take it seriously! :)
Here is how the contents of ticket 4536167165 look (no errors there) continuity_hourly.cf.txt
, it already got the output transferred from the worker, but no randomized-prefix-soft link created yet.
ubuntu@scheduler2:/mnt/pdr/.cuneiform/4536167165$ ls -lt total 2318528 -rw-r--r-- 1 condor condor 1134 Feb 10 15:59 1455117604091cjlog.txt -rw-r--r-- 1 condor condor 78448 Feb 10 15:59 condor_stdout.txt -rw-r--r-- 1 condor condor 2374070272 Feb 10 15:59 2016-02-09:09.csv.gz -rw-r--r-- 1 condor condor 54 Feb 10 15:20 stdout.txt -rwxrwxrwx 1 condor condor 2278 Feb 10 15:20 cfscript -rwxr-x--- 1 condor condor 381 Feb 10 15:20 cfsubmitfile -rw-r--r-- 1 condor condor 0 Feb 10 15:20 condor_stderr.txt -rw-r--r-- 1 condor condor 1199 Feb 10 15:20 report.txt -rw-r--r-- 1 condor condor 0 Feb 10 15:20 stderr.txt
There are a bunch of files that were successfully transferred from the workers to the scheduler already and they look like this, which to me indicates that the problem is intermittent: ubuntu@scheduler2:/mnt/pdr/.cuneiform$ ls -lt 4536167189 total 7586236 -rw-r--r-- 1 condor condor 1134 Feb 10 16:20 1455117604120cjlog.txt -rw-r--r-- 1 condor condor 451 Feb 10 16:20 report.txt -rw-r--r-- 1 condor condor 3884096949 Feb 10 16:20 4536167189_1_2016-02-09:12.csv.gz -rw-r--r-- 1 condor condor 3884096949 Feb 10 16:18 2016-02-09:12.csv.gz -rw-r--r-- 1 condor condor 86479 Feb 10 16:17 condor_stdout.txt -rw-r--r-- 1 condor condor 54 Feb 10 15:20 stdout.txt -rwxr-x--- 1 condor condor 381 Feb 10 15:20 cfsubmitfile -rw-r--r-- 1 condor condor 0 Feb 10 15:20 condor_stderr.txt -rw-r--r-- 1 condor condor 0 Feb 10 15:20 stderr.txt -rwxrwxrwx 1 condor condor 2278 Feb 10 15:20 cfscript ubuntu@scheduler2:/mnt/pdr/.cuneiform$ locate 2016-02-09:13 ubuntu@scheduler2:/mnt/pdr/.cuneiform$ find . -name 2016-02-09:13.csv.gz