Closed verbit closed 8 years ago
However, when a run is short enough (in my case ~30s) it happens that only 0 bytes are copied for some workers.
Were you able to verify that the worker logs were non-empty after Peel exited?
Yes.
I was also able to verify (for HDFS2
and my dstat extension) that data was written into the system's logfile during the period of a run which, however, was not copied into the results log of the run.
Can you check whether the NFS is exported with sync
or async
in your case?
It is async
.
And the dstat
Peel system is running with a lifespan wider than Run
, correct?
No, it is running with the RUN
lifespan.
My bad, even with the Run
lifespan the system will be closed after the experiment finishes and the logs are copied.
OK, your suggestion for solution sounds reasonable, I will take a look at that in the next weeks.
As a short term workaround I suggest to add a timestamp in the Experiment Renner just before the logs are collected.
I think the problem is not related to NFS.
In the run lifecycle, first the systems are setUp
which actually already starts dstat before the experiment is executed (execute
). LogCollection.beforeRun
is called in execute
(which sets the offset for the lines copied) and therefore all output written before the method is called is omitted.
This was no problem before as we where only interested in the experiment output.
A simple solution would be to override beforeRun
in Dstat
where we call LogCollections beforeRun and then start Dstat, or would that cause any problems?
This would not explain why 0 bytes are copied from the system's logs into the run logs.
At least the logs generated between beforeRun
and afterRun
should be available in the run logs.
The fact that beforeRun
starts after setUp
is indeed another problem as it might prevent the dstat
headers from being copied over to the run logs. The current idea for fixing this is to use column numbers for value extraction.
Yes, your right. The other problem remains.
To remove the file header problem, I would still suggest to override beforeRun (similar to Spark
).
Peel's
LogCollection
is responsible for copying the part of the log (of a system) corresponding to a specific run into the results folder. However, when a run is short enough (in my case ~30s) it happens that only 0 bytes are copied for some workers. I could verify it for a custom peel extension as well as for theHDFS2
system running on a cluster of 11 nodes (wally006-017).I assume that the problem is that the underlying NFS does not keep up with the synchronization of the logs so that the master node has some old state for a log file (maybe client write caching?).
I know that the problem is not Peel specific. However, since Peel's
LogCollection
assumes a setup with some kind of file synchronization between nodes, this behaviour should (at least) be considered.Possible fix: Create a similar
LogCollection
(maybe in parallel to the current one) which relies on copying the log files from workers to master (viascp
f.e.). Such aLogCollection
would also be useful for system with a lot of log files (since they may otherwise create too much pressure on the NFS).