ktakagaki / hayabaya

0 stars 0 forks source link

Hierachical tree structure for folders saving results #22

Open ghost opened 8 years ago

ghost commented 8 years ago

We discussed reordering the structure of the results folder when saving the results from Hayabaya. The intention was to enable running the experiment multiple times without risking any files being overwritten.

What organization do you prefer of the tree hierachy?

Suggestions: ( "/" denotes a new level of folders, "<", ">" used for meta-naming)

/ /
ktakagaki commented 8 years ago

That looks fine, except perhaps the last folder... I personally think

results/Intel-i7/05-15-14-34-22/Integer_Boxed_MULTIPLY_04.csv

might be better, because I can't think of a situation where you wouldn't load all of the types at once (i.e. results/Intel-i7/05-15-14-34-22/*.csv). But that's a matter of taste, it's up to you since you will do most of the graphing.

ghost commented 8 years ago

So a double redundancy with the entire folder hierachy "encoding" all of the information, PLUS the filename encoding all of the information?

Sure I can do it like that.

The only thing I consider truly important is that all of the information is encoded within the csv file. This exactly why I have some of the additional columns that might appear to be slightly redundant. But it makes sure that the tables are conforming to the rules of relational algebra, so each row is a composite primary key that is unique in all of the potentially millions of rows that are to be generated.

It also makes processing and "grouping" the data easier in R. With these additional columns it's possible to do in 1-2 lines what would take God knows how many lines, perhaps 100 lines in Python or Matlab.

Please see tutanota e-mail regarding Friday.

I'm reading the Scalastyle repository source code as it's relevant to the CLI parsing and it's a well written project. https://github.com/scalastyle/scalastyle

Do you know of any other small Scala projects that I can read as a beginner?

I have said it before, to make sure it's explicit. I know most of the "components" of Scala. But when it comes to putting it all together I'm really having a hard time. This is only transient and as I continue writing more code it will disappear, it's just like when I started with R and later Java. But in the mean time, having good repositories to read can be a TRULY invaluable blessing. So if you know of any reasonable sized projects besides Scalastyle I can check out, please let me know!

@ktakagaki

ktakagaki commented 8 years ago

Haven't done much reading myself, but these might be to-the-point and not too daunting: https://github.com/garyKeorkunian/squants https://github.com/non/spire

ghost commented 8 years ago

@ktakagaki I recently learned how to "easily" call Git from inside scala using the System ProcessBuilder in Scala.

Here is a small example that prints current git HEAD sha1 hash when run (assuming Git is installed on the host, and added properly to the systems environment variable as it always should be)

package de.lin.hayabaya.playground

import sys.process._
import scala.language.postfixOps

object SystemCall {

  def printLines(): String = ("ls -al" !!).toString

  def printHash(): String = {
    val res = ("git rev-parse HEAD" !!).toString
    res
  }

  def main(args: Array[String]): Unit = {

    println("Hello")

    val hash = SystemCall.printHash()

    println("The git hash is: " + hash)
  }

}

The resulting output is simply

Hello
The git hash is: 7984a34f61eedfbc6bc0c70e7346fb0de72ed053

New suggested approach using git

Therefore, based on this approach. I suggest we change the previously drafted hierachy for file output and simplify it and flatten it, exploiting git sha1 hashes.

So we go back to just outputting files into a /results folder. And each time hayabaya is run it run roughly like this

  1. Run all of the profiling tests, operations etc
  2. store results
  3. Get the *5 first characters of the currenct sha1 hash for HEAD
    • Add a column to the csv file with this 5 char string
  4. output csv file into /results e.g. \ /results/b0de7-results.txt**
  5. Do a git commit with a default constructed message to increment current git hash, so on next run files are not overwritten ever!

So all of the different operations, datatypes etc. are stored in 1 csv file plus a new column containing the 5 char string for creating a unique group/run ID later on when loading multiple csv files.

Advantages to this approach

  1. Files will never be accidently overwritten
  2. We can identify at what point of the hayabaya development a result was created
  3. A flat and simple output file hierachy
  4. (my favorite) makes it much easier to analyze the results by groups, runs, repetitions etc. in R as the sha1 hash column in the csv file creates a unique group/run ID (think SQL tables here) aka a primary key
  5. Will implicitly protect against several pittfalls/bugs in R when analyzing the results
  6. reduced size of the Hayabaya.jar file with no Jgit dependency