Open levinas opened 9 years ago
@levinas For the input data size, is this measured in bytes or in number of reads ?
For the time, I suppose this can be captured by the Python code, using the elapsed time between the moment that the job starts and the moment that the job ends.
For assembly/recipe, this is obviously already available in the Python code. Is this a string ?
For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.
For the input data size, is this measured in bytes or in number of reads ?
Ideally, this should be measured in the number of bases. We have talked about running FastQC on all assembly input; maybe we should just extract this number from there. Otherwise, just the filetype and the raw file size could be a good proxy (e.g. (fasta, 1G) or (fastq.bz2, 300M)). For the time, I suppose this can be captured by the Python code, using the elapsed time between the moment that the job starts and the moment that the job ends.
For assembly/recipe, this is obviously already available in the Python code. Is this a string ?
Yes. We could probably just capture the method string including the “assembler/recipe/pipeline/wasp” prefix. So something like “-a velvet” or “-r smart”. We could postprocess/cluster these strings later. For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.
I don’t know how to do that either.
Can we grab the PID from the subprocesses and poll memory usage? Not sure if this is the best way.
Can we implement something like a conditional pull for the compute nodes? If the data set is small, for example, the control node can tag it "small", and it could be consumed by a regular VM with 24GB memory. This is what Chris envisioned in the original architectural diagram.
Yes, I'll have to double check, but the idea is that nodes can subscribe to multiple queues, and the control server would route to the correct ones.
On Wed Feb 11 2015 at 9:08:51 PM Fangfang Xia notifications@github.com wrote:
Can we implement something like a conditional pull for the compute nodes? If the data set is small, for example, the control node can tag it "small", and it could be consumed by a regular VM with 24GB memory. This is what Chris envisioned in the original architectural diagram.
— Reply to this email directly or view it on GitHub https://github.com/kbase/assembly/issues/283#issuecomment-74010322.
In the callback method in consume.py, the json payload is received. Does the tag need to be specified in channel.basic_consume ?
Minimally a four-tuple for each assembly job:
This data will be used to prepare for regular worker nodes devoted to small jobs.