Closed ashander closed 7 years ago
I'm just outputting simple human-readable sentences. In hindsight, I should have output some sort of csv/tsv format that can be copy/pasted right into a table....
I think we are waiting for the the offset on sorting, and thus the Edgeset -> Edge transition, to do the full benchmarks. So, should we come up with a format that we can both add to our output?
Sure. I'm happy with something relatively simple. Either plain .csv.gz that we can append later. Or, pandas.DataFrames.
Some table format like: N theta rho total_runtime [other columns for time spent in other stuff...]
Looks good. Let's go with csv / csv.gz for the on-disk output
On Mon, Sep 25, 2017 at 10:08 AM, Kevin R. Thornton < notifications@github.com> wrote:
Sure. I'm happy with something relatively simple. Either plain .csv.gz that we can append later. Or, pandas.DataFrames.
Some table format like: N theta rho total_runtime [other columns for time spent in other stuff...]
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/2#issuecomment-331947907, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOFDXOoPWX_BB02WoPbx5-Z0npq4mks5sl94OgaJpZM4PgGct .
Any thought on column names? Mine look like this:
Time spent in C++ simulation was 19467.071261857698 seconds. (26.9505083028232% of total) Time spent related to msprime functionality: Prepping: 251.4973750114441 seconds (0.3481767751405126%). Appending: 2471.059741258621 seconds (3.4209725324243494%). Sorting: 41988.04032611847 seconds (58.1288789775738%). Simplifying: 8054.999574661255 seconds (11.151463412038149%). There are 118 segregating mutations. There are 10 fixations.
The breakdown within msprime is quite useful for diagnostics but we may not be able to duplicate it exactly in ftprime. I think we should still output it but maybe not save to the csv?
So for results csv we would have:
N theta rho total_runtime fwd_sim_runtime msprime_runtime
On Mon, Sep 25, 2017 at 10:36 AM, Kevin R. Thornton < notifications@github.com> wrote:
Any thought on column names? Mine look like this:
Time spent in C++ simulation was 19467.071261857698 seconds. (26.9505083028232% of total) Time spent related to msprime functionality: Prepping: 251.4973750114441 seconds (0.3481767751405126%). Appending: 2471.059741258621 seconds (3.4209725324243494%). Sorting: 41988.04032611847 seconds (58.1288789775738%). Simplifying: 8054.999574661255 seconds (11.151463412038149%). There are 118 segregating mutations. There are 10 fixations.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/2#issuecomment-331956049, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOHflmwjsYi-4Xarbzg-3sS0G-8n0ks5sl-SZgaJpZM4PgGct .
Maybe we should also add gc interval? With column header simplify_interval
?
Sure. Not a bad idea. I've done that separately, and 1,000 is the fastest for me, and so that's what I'm using in the benchmarks varying theta/rho.
On Mon, Sep 25, 2017 at 3:22 PM ashander notifications@github.com wrote:
Maybe we should also add gc interval? With column header simplify_interval ?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/2#issuecomment-332029990, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnH8j0uH1OhkdwFrUaKw_H3kYa9e-Xks5smCZygaJpZM4PgGct .
--
Kevin Thornton
Associate Professor
Ecology and Evolutionary Biology
UC Irvine
I just realized something. I'm getting total runtime externally, via /usr/bin/time. The reason is that a job can be "done", but still take a lot of time to quit due to having to deallocate memory. It can actually be quite significant!
Thoughts?
I just realized something. I'm getting total runtime externally, via /usr/bin/time. The reason is that a job can be "done", but still take a lot of time to quit due to having to deallocate memory. It can actually be quite significant!
/usr/bin/time is very handy for getting the total memory usage of a process, so I tend to use it for benchmarking purposes as well. I'd be surprised if the process teardown time was significant on the timescales that we're working on though.
Out of interest, are we tracking memory usage here?
/usr/bin/time is very handy for getting the total memory usage of a process, so I tend to use it for benchmarking purposes as well. I'd be surprised if the process teardown time was significant on the timescales that we're working on though.
Yeah--I've had it be a few minutes, though!
Out of interest, are we tracking memory usage here?
I have been via /usr/bin/time -f "%e %M" -o time.out
kinda commands.
Yeah--I've had it be a few minutes, though!
Wow, that is surprising! What was happening here? I thought that any unfree'd memory from malloc was just handed straight back to the OS with basically zero overhead.
Wow, that is surprising! What was happening here? I thought that any unfree'd memory from malloc was just handed straight back to the OS with basically zero overhead.
When C++ objects to out of scope, their destructors are called. It is how the language helps you guarantee writing leak-free code. When a forward sim is "big", there are lots of gametes, mutations, and diploids, each with their own destructor. The gametes themselves are containers whose memory must be free'd. Well, deleted, rather, as C++ allocates with new and frees with delete. That can add up to 10s of millions of calls for that case.
In a command-line program, you can manually call std::exit instead of return from main, and all of that is skipped. Obviously, that's not cool in general, as when in the Python interpreter or a separate C++ thread, etc.
When C++ objects to out of scope, their destructors are called. It is how the language helps you guarantee writing leak-free code. When a forward sim is "big", there are lots of gametes, mutations, and diploids, each with their own destructor. The gametes themselves are containers whose memory must be free'd. Well, deleted, rather, as C++ allocates with new and frees with delete. That can add up to 10s of millions of calls for that case.
Ah, that makes sense. Thanks for explaining.
I'd used /use/bin/time in initial investigations too. But sounds like it should stay
On Wed, Sep 27, 2017 at 1:03 AM Kevin R. Thornton notifications@github.com wrote:
/usr/bin/time is very handy for getting the total memory usage of a process, so I tend to use it for benchmarking purposes as well. I'd be surprised if the process teardown time was significant on the timescales that we're working on though.
Yeah--I've had it be a few minutes, though!
Out of interest, are we tracking memory usage here?
I have been via /usr/bin/time -f "%e %M" -o time.out kinda commands.
This looks good. Is there an easy way to combine this output with the parameters and more detailed timing that we're writing from the script itself?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/2#issuecomment-332442257, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOFAbp6dDRg0u7eZiaG2cKuzyoO_0ks5smgErgaJpZM4PgGct .
-- -Jaime
This looks good. Is there an easy way to combine this output with the parameters and more detailed timing that we're writing from the script itself?
I think I'm going to write the internal time details to a csv file, save the output of /usr/bin/time
separately, then have a script put it all together so the tidyverse can save us.
Need a format for total timing.