peelframework / peel

Peel is a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms.
http://peel-framework.org
Apache License 2.0
27 stars 32 forks source link

Peel's LogCollection messed up by NFS? #67

Closed verbit closed 8 years ago

verbit commented 8 years ago

Peel's LogCollection is responsible for copying the part of the log (of a system) corresponding to a specific run into the results folder. However, when a run is short enough (in my case ~30s) it happens that only 0 bytes are copied for some workers. I could verify it for a custom peel extension as well as for the HDFS2 system running on a cluster of 11 nodes (wally006-017).

I assume that the problem is that the underlying NFS does not keep up with the synchronization of the logs so that the master node has some old state for a log file (maybe client write caching?).

I know that the problem is not Peel specific. However, since Peel's LogCollection assumes a setup with some kind of file synchronization between nodes, this behaviour should (at least) be considered.

Possible fix: Create a similar LogCollection (maybe in parallel to the current one) which relies on copying the log files from workers to master (via scp f.e.). Such a LogCollection would also be useful for system with a lot of log files (since they may otherwise create too much pressure on the NFS).

aalexandrov commented 8 years ago

However, when a run is short enough (in my case ~30s) it happens that only 0 bytes are copied for some workers.

Were you able to verify that the worker logs were non-empty after Peel exited?

verbit commented 8 years ago

Yes. I was also able to verify (for HDFS2 and my dstat extension) that data was written into the system's logfile during the period of a run which, however, was not copied into the results log of the run.

aalexandrov commented 8 years ago

Can you check whether the NFS is exported with sync or async in your case?

verbit commented 8 years ago

It is async.

aalexandrov commented 8 years ago

And the dstat Peel system is running with a lifespan wider than Run, correct?

verbit commented 8 years ago

No, it is running with the RUN lifespan.

aalexandrov commented 8 years ago

My bad, even with the Run lifespan the system will be closed after the experiment finishes and the logs are copied.

OK, your suggestion for solution sounds reasonable, I will take a look at that in the next weeks.

As a short term workaround I suggest to add a timestamp in the Experiment Renner just before the logs are collected.

akunft commented 8 years ago

I think the problem is not related to NFS. In the run lifecycle, first the systems are setUp which actually already starts dstat before the experiment is executed (execute). LogCollection.beforeRun is called in execute (which sets the offset for the lines copied) and therefore all output written before the method is called is omitted.

This was no problem before as we where only interested in the experiment output.

A simple solution would be to override beforeRun in Dstat where we call LogCollections beforeRun and then start Dstat, or would that cause any problems?

verbit commented 8 years ago

This would not explain why 0 bytes are copied from the system's logs into the run logs. At least the logs generated between beforeRun and afterRun should be available in the run logs.

The fact that beforeRun starts after setUp is indeed another problem as it might prevent the dstat headers from being copied over to the run logs. The current idea for fixing this is to use column numbers for value extraction.

akunft commented 8 years ago

Yes, your right. The other problem remains. To remove the file header problem, I would still suggest to override beforeRun (similar to Spark).