kbaseattic / assembly

An extensible framework for genome assembly.
MIT License
12 stars 14 forks source link

capture basic performance data from jobs #283

Open levinas opened 9 years ago

levinas commented 9 years ago

Minimally a four-tuple for each assembly job:

  1. Input data size
  2. Assembler/recipe
  3. Peak memory usage
  4. Execution time

This data will be used to prepare for regular worker nodes devoted to small jobs.

sebhtml commented 9 years ago

@levinas For the input data size, is this measured in bytes or in number of reads ?

For the time, I suppose this can be captured by the Python code, using the elapsed time between the moment that the job starts and the moment that the job ends.

For assembly/recipe, this is obviously already available in the Python code. Is this a string ?

For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.

levinas commented 9 years ago

For the input data size, is this measured in bytes or in number of reads ?

Ideally, this should be measured in the number of bases. We have talked about running FastQC on all assembly input; maybe we should just extract this number from there. Otherwise, just the filetype and the raw file size could be a good proxy (e.g. (fasta, 1G) or (fastq.bz2, 300M)). For the time, I suppose this can be captured by the Python code, using the elapsed time between the moment that the job starts and the moment that the job ends.

For assembly/recipe, this is obviously already available in the Python code. Is this a string ?

Yes. We could probably just capture the method string including the “assembler/recipe/pipeline/wasp” prefix. So something like “-a velvet” or “-r smart”. We could postprocess/cluster these strings later. For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.

I don’t know how to do that either.

cbun commented 9 years ago

Can we grab the PID from the subprocesses and poll memory usage? Not sure if this is the best way.

levinas commented 9 years ago

Can we implement something like a conditional pull for the compute nodes? If the data set is small, for example, the control node can tag it "small", and it could be consumed by a regular VM with 24GB memory. This is what Chris envisioned in the original architectural diagram.

cbun commented 9 years ago

Yes, I'll have to double check, but the idea is that nodes can subscribe to multiple queues, and the control server would route to the correct ones.

On Wed Feb 11 2015 at 9:08:51 PM Fangfang Xia notifications@github.com wrote:

Can we implement something like a conditional pull for the compute nodes? If the data set is small, for example, the control node can tag it "small", and it could be consumed by a regular VM with 24GB memory. This is what Chris envisioned in the original architectural diagram.

— Reply to this email directly or view it on GitHub https://github.com/kbase/assembly/issues/283#issuecomment-74010322.

sebhtml commented 9 years ago

In the callback method in consume.py, the json payload is received. Does the tag need to be specified in channel.basic_consume ?