netj / 3x

3X — a Workbench for eXecutable eXploratory eXperiments
http://netj.github.io/3x/
21 stars 4 forks source link

Aborted run #14

Open mhortis opened 10 years ago

mhortis commented 10 years ago

Hello,

I've been running an experiment that takes approx. 4 hours to execute and produces approx. 90000 output lines. When the experiment is finished, it is shown as Aborted and not as Complete. Looking in the run overview, I can see that 5 of the experiment outputs produce the error "cat: write error: Broken pipe". I wonder how I can overcome this error and if it is caused by 3x or by my OS (Ubuntu Server 12.04).

netj commented 10 years ago

Hi, I'm sorry to hear that the long running runs were not marked complete as expected.

I'm trying to figure out at which stage of 3X's execution this could've happened. It would help a lot if you could attach a copy of a run overview page of your aborted runs here, removing most of that 90k lines and any other text you may not want to share. Also, could you attach the log file .3x/gui/log.runs under your repository, which contains useful info for debugging?

mhortis commented 10 years ago

1) I am running on a local machine. 2) I can see the output in the stdout, both in the GUI and in the files under the workdir. The run looks like finished, as all the expected output is displayed. 3) The errors are displayed in the GUI not in the general stderr section, but in the stderr section under each of the outputs that "failed" (although displayed in the output and in the stdout sections).

In the log.runs log file I can see the following lines repeatedly (assuming one repeat per run):

Error: near line 2: no such column: run/queue/main local[0]: Aborted Execution

I attach two screenshots of the overview page. I include only the output section and the stderr under some of the outputs. screen shot 2014-01-30 at 10 08 50 pm screen shot 2014-01-30 at 10 09 07 pm

Thanks for your help!

netj commented 10 years ago

After trying to reproduce the bug and inspecting 3X's code quite a bit, I'm suspecting the runs were aborted after 3X fails to store the output values to its index.

Here are a list of things you can try to prevent future runs from getting stuck at the ABORTED state:

  1. Rebuild the index by running the following command in your repository:

    3x index rebuild

    After this, you will probably see the outputs of the ABORTED runs from the GUI Results tab.

    To mark all the ABORTED runs as DONE, run the following command:

    3x hack queue list-only serial state#=ABORTED | xargs 3x hack queue mark-as DONE

    Alternatively, you can selectively mark the runs as DONE by passing their "serial" numbers as additional command-line arguments to:

    3x hack queue mark-as DONE 
  2. If your output value can contain whitespace characters, the v0.9 release might not work correctly. I recommend using the LATEST version instead (a version that contains aa35a38).
  3. The no such column error was probably caused by not running 3x define sync after adding new input or output parameters. If you add parameters with 3x define yourself, please make sure you also run 3x define sync before starting any run. I'll fix 3X to automatically handle this soon.