microbiome / benchmarking

Benchmarking miaverse performance
Artistic License 2.0
0 stars 0 forks source link

Antagomir melt #3

Closed antagomir closed 3 years ago

antagomir commented 3 years ago

I tested melting benchmarks, and modified a bit. Should be fine to merge and explore further.

More importantly, I have the following suggestions to proceed.

1) Let us first debug things fast by limiting to data sets with smaller sample sizes, and only add larger data in the end when everything is clear? Also, let us focus on a single test first to make sure we know how to do this. Then it will be more straightforward to do the other tests. We can start with melt example.

2) Note the changes that I made in melt_benchmark in the data chunk -> move these changes to data.R and apply everywhere; we must have unique data set names as one field in the data frame, and not non-unique dataset names as rownames

3) Question: why some entries in df$Samples have the value NA? Ideally, such cases could be cleared up already when collecting the execution time data.

4) Can we next include the results from multiple sample sizes in df? One column would indicate what sample size was used. You can of course first generate the data.frames for each fixed N, then merge the rows to get one big data.frame that includes experiments from all different sample sizes -> This will allow us to visualize also the effects of varying sample sizes

5) There is now lot of variation and no systematic trend; this may depend on data specifics; could we get a similar table of running times within each data set? For instance, by running melt for different taxonomic levels? Each taxonomic level has a different number of features; the splitByRanks function gives abundance tables for all ranks -> We can do this first for one single data set to keep things simple; and then add more data sets when all is setup successfully

6) Include only cases where AssayValues=="counts"; complicates too much otherwise and no real added value

7) It can help to design running time tests as functions (data set/s as input, running time/s as output); this can make the code more readable and easier to manage