sosy-lab / benchexec

BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
Apache License 2.0
231 stars 199 forks source link

Improved CSV export - feedback welcome #900

Open PhilippWendler opened 1 year ago

PhilippWendler commented 1 year ago

The CSV tables exported by table-generator have a layout that is inspired by the HTML tables, but this sometimes makes them hard to use programmatically in other tools. We should improve this.

Open points:

In general, there is a trade-off between having tables that always have exactly same format (all task-id columns, header content with full information) even if redundant / not applicable and tables that are tailored to the specific use case (keeping column names short and easy to handle when they are anyway unique, hiding expected verdict if empty, etc.). The latter can be much more convenient in many use cases, but are more difficult to use in use cases where data from lots of different scenarios are combined.

Maybe we also need to add some options to the table definitions to make it possible for users to choose among them (e.g., which columns should be shown for the task id).

Any feedback and ideas, whether about the general goal or concrete ideas, is highly welcome! @s-winter ping

Po-Chun-Chien commented 1 year ago

I would vote for using tab as separator, but changing the file extension to .tsv. I was confused the first time when trying to parse the file.

PhilippWendler commented 1 year ago
1. Header lines: It's a good idea to reduce the header to one line to make it more compact. Regarding the content of header cells, you could use concatenation of run set name and column name to make it unique, and keep the timestamp optional. For example, "runSetName_status" or "runSetName_columnName".

Yes, this is the idea mentioned in the original post, but it has the disadvantage that it would make the column names really long and complex to use. For example, they would need to include a timestamp, and thus after importing in some third-party software, you would have to use these long and unique column names instead of for example just cputime. So we are looking for arguments for and against each of these possible choices.

2. Separator: Since comma appears regularly in some columns, it might be better to switch to tab-separated values (TSV) instead of comma-separated values (CSV). However, it's important to inform users about this change and explain the TSV abbreviation.

Note that we already use tabs in our current "CSV" format. Notification of users is not difficult, we can likely add the new format as an option in addition to the existing format and then delete the previous format in a new major version.

3. Task-ID columns: Instead of showing only those columns where not all values are equal, it might be better to show all columns for which data exists. This way, users can easily identify the relevant columns.

Hm, I am not sure I follow this argument. How would it make it easy to identify the relevant columns, if all columns are shown?

4. Table options: It would be helpful to add some options to the table definitions so that users can customize the table layout based on their specific use case. For example, users could choose which columns should be shown for the task ID.

Note that we do have this feature already (cf. documentation).

DrMichaelPetter commented 7 months ago

How about keeping a raw/master TSV around and some postprocessing scripts based on CLI tools like sed/head/tail/grep and csvkit?

PhilippWendler commented 7 months ago

The raw/master files that BenchExec uses are the result XML files. We cannot use CSV/TSV for this, because these files contain important meta information about the whole benchmark run, which we need to keep together with the measurement data. This is important for example for creating the HTML tables (which contain both) and also makes archiving results easier.