Discussion: Qualitative properties and extended task definitions

sosy-lab / benchexec

BenchExec: A Framework for Reliable Benchmarking and Resource Measurement

Apache License 2.0

227 stars 192 forks source link

Discussion: Qualitative properties and extended task definitions #994

Closed incaseoftrouble closed 6 months ago

incaseoftrouble commented 6 months ago

EDIT: I changed the issue's title since in the discussion it became apparent that some of the raised points below are not valid or moved to a separate issue.

This might be wishing for (too) much, but I think what might profit benchexec quite a lot if it were less entangled with the specifics of SV-COMP.

For example, result.py makes strong assumptions about the nature of reference results and output constants: The result categories are effectively hard-coded and expected results assume that there is a (small) finite list of possible results. This is fine for qualitative problems (yes/no questions), but makes it hard to use for qualitative tools,especially with approximation. Concretely, the "correct" result might be a float between 0 and 1 and any result within 10^-6 of the exact value is considered correct.

An abstract idea would be to provide, just as as with tool-info, a "result-handler" definition which handles the "interpretation" of results. These could than be referenced by name in the benchmark definitions. So something like <handler><name>svcomp</name></handler> or <handler><name>approximate</name><precision>1e-6</precision></handler> (all non-name fields would be passed to the __init__ as kwargs to the handler). Same as with tool info, users should then be able to register their own handlers and a set of standard handlers could be provided. For backwards compatibility, if that field is not specified, the standard svcomp is used. (I think is fine to require that a correct / incorrect / unknown verdict can be given in the end)

On a related note, I think that the result handler should also specify how the result looks like. It could be a single value, but also a dictionary of multiple values + internal statistics. I believe that in general determine_result should not return a single string but a ToolResult object, which by default contains the status string, but potentially some further information (e.g. detailed reason for an error or statistics). (Note: Backwards compatibility is easy to ensure here). This would connect quite nicely to table generation: Result handlers could also provide additional columns for the result table. As a concrete example, I am dealing with finite state models and a useful number is how many states have been constructed and / or how many iterations have been applied etc. It feels likely that it should even be possible to automatically let these custom columns flow into the plots, depending on their type. (Here, it might be good to allow for non-numeric columns, e.g. it might be nice to see any warnings that a tool produced in the table.)

I find benchexec a very very useful project, however using it for my cases felt like a hack because of these underlying assumptions. I want to push benchexec as the default for upcoming iterations of QComp (part of TOOLympics :-) ), and this is a major point (together with #993, but that one is less important)

I realize that this is a lot of (vague) ideas packed into one issue. However, I think these points are somewhat interlinked, so it makes sense to first view them as one block. I think it should be possible to break this up into several smaller pieces once the picture becomes clearer. I am very happy to discuss details and/or contribute code once we settled for something!

EDIT: One more thing (which is mostly independent, but I think it belongs to this general discussion): The task-definition files also introduce some assumptions on the structure benchmarks. For example, PRISM input definitions are (i) a model file, (ii) a list of string key-value pairs as "model constants", (iii) a property file, and (iv) a property reference. I think it should be reasonably easy to also register a "run builder" or similar, i.e. a way to construct the list of runs from a given file.

Concretely, before these lines:

https://github.com/sosy-lab/benchexec/blob/a0718af3e4494ed0c86aa47a1b85deaf2f444646/benchexec/model.py#L619-L640

it should be possible to add a type tag to the definition (e.g. sourcefilesTag.get("type")), look in the registry if a handler of this type is known and then call create_run_from_definition(...)). TaskDefinitionHandler would be then be one of the instances of such a handler (backwards compatibility can again be ensured)

PhilippWendler commented 6 months ago

Thank you for contacting us! I am glad to hear that you like BenchExec, and we are always happy when competitions adopt BenchExec and try to help.

I realize that this is a lot of (vague) ideas packed into one issue. However, I think these points are somewhat interlinked, so it makes sense to first view them as one block. I think it should be possible to break this up into several smaller pieces once the picture becomes clearer. I am very happy to discuss details and/or contribute code once we settled for something!

Yes, from BenchExec's point of view these are mostly independent, and actually much is already possible. So I will respond below and try to clarify the overview and show what is already possible, such that you can then see where further discussion is necessary. Then separate issues for the side topics would be a good plan.

What you interpret as a dependency on SV-COMP should be only a visibility issue. BenchExec definitively shouldn't have a dependency on SV-COMP, as we want it to be a general benchmarking utility, and I believe this is already the case. It is possible to run BenchExec without any connection to SV-COMP or a similar style of tasks (for example, property files, scoring and result classification inside BenchExec are all optional). It is just that features like score computations happen to be implemented only for the use case of SV-COMP so far. But we have for example tool-info modules for SMT solvers even here in the repository.

You can in principle even let BenchExec do the benchmarking and handle scoring etc. outside of it (Test-Comp does this to a large degree, for example), but it would be a welcome addition to have these features inside BenchExec also for other contexts like QComp, and I would be glad to see this happen.

Another general comment: You are suggesting the possibility to add more user-defined handlers in several places. I am a little bit hesitant about going that far, because defining a stable and useful API for such things is difficult. So my generally preferred strategy would be to instead add the requested support for e.g. QComp directly to BenchExec, like we have it for SV-COMP, and not as a replaceable module. (If in the future more and more of these use cases appear and flexibility becomes more important, we can still reconsider this and then define an API based on the more extensive experience and knowledge we will have at that point.)

For distinguishing between usage in the different contexts, I suggest we continue to use the property files, then users don't need to manually specify this. This is already how BenchExec treats SV-COMP: If you provide an SV-COMP property file, then BenchExec will apply scoring as in SV-COMP, otherwise it doesn't compute scores. As you seem to also have property files in QComp, this seems like a good fit for QComp as well.

For example, result.py makes strong assumptions about the nature of reference results and output constants: The result categories are effectively hard-coded and expected results assume that there is a (small) finite list of possible results. This is fine for qualitative problems (yes/no questions), but makes it hard to use for qualitative tools,especially with approximation. Concretely, the "correct" result might be a float between 0 and 1 and any result within 10^-6 of the exact value is considered correct.

Yes, result classification and scoring for qualitative results are currently not implemented. But I don't see any blockers for adding what you describe here. At least as long as the result classification still results in something that can be mapped to correct/wrong/unknown/error plus an optional numeric score, changes should be relatively few and local. But we also already have a precedent for an extension of this result classification in SV-COMP.

The expected result could be represented together with the precision in the task-definiton file, either as something like expected_verdict: 0.42 ±1e-6 or with an additional key.

On a related note, I think that the result handler should also specify how the result looks like. It could be a single value, but also a dictionary of multiple values + internal statistics. I believe that in general determine_result should not return a single string but a ToolResult object, which by default contains the status string, but potentially some further information (e.g. detailed reason for an error or statistics). (Note: Backwards compatibility is easy to ensure here). This would connect quite nicely to table generation: Result handlers could also provide additional columns for the result table. As a concrete example, I am dealing with finite state models and a useful number is how many states have been constructed and / or how many iterations have been applied etc.

I believe everything that you would like to do here is already possible. For detailed reason for errors the typical convention is to append this (in parentheses) to the result in determine_result (e.g., have ERROR (parsing failed) as result). BenchExec treats this (like any result string that it does not know) as one indicating an error result and you can nicely filter for it in the tables.

Additional columns, e.g., with statistics, are also possible, as long as they can be extracted from the tool's output. Their handling it is just not mixed with the result handling in determine_result but done separately in get_value_from_output.

It feels likely that it should even be possible to automatically let these custom columns flow into the plots, depending on their type.

Yes, BenchExec already treats all numeric columns in the same way, no matter whether they contain data created by BenchExec or not, so you immediately get statistics and plots.

(Here, it might be good to allow for non-numeric columns, e.g. it might be nice to see any warnings that a tool produced in the table.)

Arbitrary non-numeric columns are possible, BenchExec will simply treat any column does not only contain numbers as textual and skip it in plots and statistics.

An example for all this is this table from CPAchecker's regression tests. It has arbitrary error results and further textual and non-textual columns, the latter even with and without units.

The task-definition files also introduce some assumptions on the structure benchmarks. For example, PRISM input definitions are (i) a model file, (ii) a list of string key-value pairs as "model constants", (iii) a property file, and (iv) a property reference.

These seems to match relatively well with what the current format supports:

The model file would be used as the input file.
Property file is a 1:1 match.
Additional key-value pairs can be put into options. BenchExec ignores these but the benchmarked tool can get them as inputs.
Property reference is something I do knot know, maybe it can be put into options or the format could be extended with an additional key.

PS: If you happen to visit ETAPS this year, we should definitively meet there.

incaseoftrouble commented 6 months ago

and actually much is already possible.

I was also thinking this, but I also think right now a few things require expert knowledge / modifying the sources, which I would want to avoid. (See the point on handlers.)

for example, property files, scoring and result classification inside BenchExec are all optional

Indeed! But having the benchmarking tool also directly produce plots (without needing to implement parsing on one's own) is suuuuper nice for non-expert users. So by having the possibility to easily specify scoring etc in custom code would significantly increase user-friendliness (I think)

You are suggesting the possibility to add more user-defined handlers in several places. I am a little bit hesitant about going that far. (...)

I agree with the sentiment; I wouldn't want to go so far in one step! What I wanted to propose is that one can add several handlers only internally (i.e. with a static mapping etc.). So, if someone wants to add their own, they would modify the source and not be guaranteed to have forward compatibility. But there would be a concrete region to point them to (as in "if you want to have your own, provide a subclass of this class and put it into this lookup dictionary"). By internally disentangling things, one would get already a better view of what is required.

On a related note: I definitely agree with the visibility issue, when I wanted to use it I was a bit deterred by this apparent alignment with SV-COMP and I felt like I have to hack / mis-use things for my application. By shifting perspectives and, instead of "... and this is used for SVCOMP" to "... this is modularized and we provide the module used for SV-COMP by default" much more suggests "I can use this for anything" to a standard user. Especially if the names are generic; in the sense that there wouldn't be an "svcomp" and "qcomp" handler, but a "equality" and "approximate" handler or something like that (which could then still be subclassed to meet specific demands).

Ideally, it should eventually be possible for users to say "install standard benchexec and run it with benchexec --include <handlers definition> <rest of the command>. (But that is only relevant if benchexec already is used numerous context and something for the future, if at all.)

As you seem to also have property files in QComp, this seems like a good fit for QComp as well.

But they can have different formats (there is no fixed file-ending) etc., so it is difficult to auto-detect them. Also, some things are specified not through properties files, but directly via command line, and how to do that differs from tool to tool. Its very messy and non-standardized :-) (which we try to tackle on a Dagstuhl meeting soon, but the mess will remain for quite some time.)

I would be in favor of providing a way to clearly specify it. There can be a best-effort guessing mechanism in case users do not specify it, but if I want to avoid "magic" I would like to have the option to specify what happens.

The expected result could be represented together with the precision in the task-definiton file, either as something like expected_verdict: 0.42 ±1e-6 or with an additional key.

Definitely in favor of additional keys (in general, I am against magic inside strings :) ). For example, we also would want to specify relative or absolute error, or might consider statistical methods (which take two parameters). Just seems more extensible this way :)

BenchExec treats this (like any result string that it does not know) as one indicating an error result and you can nicely filter for it in the tables.

I've seen this, but it seems a bit "magic". Migrating this to objects would feel cleaner and easier to manage/maintain I think (this is one of the cases where I would be happy to provide a draft to discuss)

Additional columns, e.g., with statistics, are also possible, as long as they can be extracted from the tool's output. Their handling it is just not mixed with the result handling in determine_result but done separately in get_value_from_output.

Ah! I see. But this would mean parsing the tool's output n times, right? (Not that it usually matters)

Arbitrary non-numeric columns are possible, BenchExec will simply treat any column does not only contain numbers as textual and skip it in plots and statistics.

Aha, the Column class had some number of digits field which led me to believe it always expects a number.

Additional key-value pairs can be put into options. BenchExec ignores these but the benchmarked tool can get them as inputs. + Property reference is something I do knot know, maybe it can be put into options or the format could be extended with an additional key.

It is do-able, but feels "hackish". Again, my proposal would be: Modularize the "file-to-run"-parsing for now (so we can easily add new specifications internally) without presenting this as stable API. Then, it already is much easier for power-users to modify the sources and we can gather some experience on this.

(On a related note: I dislike XML and prefer YML / JSON for such specifications so I'd also be willing (low priority) to write a separate file->tools+runs thing, but that could happen outside of the repo)

PS: If you happen to visit ETAPS this year, we should definitively meet there.

Unlikely, unfortunately (I didn't manage for TACAS in time), but I'm in Munich quite often :) So maybe sometime during my lecture break? (basically last week of march + first three weeks of april)

PhilippWendler commented 6 months ago

for example, property files, scoring and result classification inside BenchExec are all optional

Indeed! But having the benchmarking tool also directly produce plots (without needing to implement parsing on one's own) is suuuuper nice for non-expert users.

Sure! That's why we have it for SV-COMP.

You are suggesting the possibility to add more user-defined handlers in several places. I am a little bit hesitant about going that far. (...)

I agree with the sentiment; I wouldn't want to go so far in one step! What I wanted to propose is that one can add several handlers only internally (i.e. with a static mapping etc.). So, if someone wants to add their own, they would modify the source and not be guaranteed to have forward compatibility. But there would be a concrete region to point them to (as in "if you want to have your own, provide a subclass of this class and put it into this lookup dictionary"). By internally disentangling things, one would get already a better view of what is required.

Yes, this would come more or less automatically once we have two different result handlers. Likely we just create subclasses of Property.

On a related note: I definitely agree with the visibility issue, when I wanted to use it I was a bit deterred by this apparent alignment with SV-COMP and I felt like I have to hack / mis-use things for my application. By shifting perspectives and, instead of "... and this is used for SVCOMP" to "... this is modularized and we provide the module used for SV-COMP by default" much more suggests "I can use this for anything" to a standard user. Especially if the names are generic; in the sense that there wouldn't be an "svcomp" and "qcomp" handler, but a "equality" and "approximate" handler or something like that (which could then still be subclassed to meet specific demands).

If you have any concrete suggestions in this regard, I would like to tune the documentation and description of BenchExec. But currently SV-COMP is hardly mentioned at all in the BenchExec docs, so I don't know where I could improve that.

Also, some things are specified not through properties files, but directly via command line, and how to do that differs from tool to tool.

This should be no problem. You need a separate benchmark definition for each tool anyway, and you can specify arbitrary command-line options in there for the tool. Only the things that are to be interpreted inside as BenchExec, e.g., because they define the expected verdict or so, need to be in a fixed format.

I would be in favor of providing a way to clearly specify it. There can be a best-effort guessing mechanism in case users do not specify it, but if I want to avoid "magic" I would like to have the option to specify what happens.

If there is no way to auto-detect it, the property kind (or however we name it) can of course be added to the task-definition format.

BenchExec treats this (like any result string that it does not know) as one indicating an error result and you can nicely filter for it in the tables.

I've seen this, but it seems a bit "magic". Migrating this to objects would feel cleaner and easier to manage/maintain I think (this is one of the cases where I would be happy to provide a draft to discuss)

I am not sure what we would gain here, but feel free to make a concrete suggestion. In the end, the main use case (after classifying it into correct/wrong/etc.) of the result string is that it is displayed in the HTML tables, so a human-readable (and concise) representation in a string needs to remain. And the easier it is to write a tool-info module, the better it is. For example, I notice that many developers only do the bare minimum inside determine_result, so BenchExec improves the result of this in several cases (like replacing ERROR with a more informative string if we have more information).

Additional columns, e.g., with statistics, are also possible, as long as they can be extracted from the tool's output. Their handling it is just not mixed with the result handling in determine_result but done separately in get_value_from_output.

Ah! I see. But this would mean parsing the tool's output n times, right? (Not that it usually matters)

Yes. In many cases, the tool outputs many values and only a few of them are to be put into the table, so I guess searching a few times for those few values is even faster than parsing everything properly and then selecting a few. But if this turns out to be a bottleneck, it would be easy to add a method that can return all values at once.

Additional key-value pairs can be put into options. BenchExec ignores these but the benchmarked tool can get them as inputs. + Property reference is something I do knot know, maybe it can be put into options or the format could be extended with an additional key.

It is do-able, but feels "hackish". Again, my proposal would be: Modularize the "file-to-run"-parsing for now (so we can easily add new specifications internally) without presenting this as stable API. Then, it already is much easier for power-users to modify the sources and we can gather some experience on this.

We want to keep the existing task-definition format, just potentially enhance it to support more use cases. So there is just one format to parse, and a internal interface for which we have only a single implementation seems to just overcomplicate things.

(On a related note: I dislike XML and prefer YML / JSON for such specifications so I'd also be willing (low priority) to write a separate file->tools+runs thing, but that could happen outside of the repo)

I guess you mean generating the benchmark definition from a more high-level description? For something like this, please have a look at https://gitlab.com/sosy-lab/benchmarking/fm-tools, which aims at collecting information about tools and competition participation. SV-COMP's benchmark definitions will at one point also be generated from the information in this repo, but the project is not SV-COMP specific.

PS: If you happen to visit ETAPS this year, we should definitively meet there.

Unlikely, unfortunately (I didn't manage for TACAS in time), but I'm in Munich quite often :) So maybe sometime during my lecture break? (basically last week of march + first three weeks of april)

Yes, April outside of ETAPS time would be fine and it would be great if we can make that happen.

ga67vib commented 6 months ago

I just want to support Tobi's core request, i.e. make benchexec more easily usable in the probabilistic verification context. During the Dagstuhl meeting at the end of March (where both of us are attending), we will present this idea and thereby hopefully convince most of the tool developers in our field as well as the organizer of QComp to use it; further, this way we get additional input what they consider useful/necessary in order to adapt benchexec.

PhilippWendler commented 6 months ago

Sure, this is great! Thanks for advertising BenchExec! I am eager to hear what is suggested there.

If any of the people there have more questions and would like to meet me in person, you can offer this for ETAPS. I will be there for the whole time, including the RRRR workshop.

ga67vib commented 6 months ago

Arnd Hartmanns (Main Organizer of QComp) is co-organizing RRRR and he will be at our Dagstuhl meeting.

incaseoftrouble commented 6 months ago

Ok, let me try to summarize what I think is the situation. Then we can derive actionable items.

Point 1: Scoring / results.py. We agree that this is tied to SV-COMP and could be generalized by providing a more flexible way of interpreting results. The logic would be that we have a "result interpreter" that asks the tool "please tell me what your result is" and then compares that to a provided reference result (if any). A result could be any object (a string, int, float, or something complex). The interpreter is expected to output a verdict (Correct, Incorrect, Unknown, Error, ...) + score + potentially additional information (that only the interpreter can know, not the tool, e.g. for approximation results how close the result actually is, which might be different to score)

Point 2: Documentation. I didn't mean that SV-COMP is explicitly mentioned, but the workflow and design of specification files etc. has a certain "bias". This is something that will arise naturally. If I use benchexec and know that there is interest for it, I can organize my findings and eventually propose changes / extensions to the doc :)

Point 3: Generalizing verdicts from string to objects. I just think this is flexibility for free. Right now, ERROR(reason) is free text and consistency is only ensured by having one result interpretation. If I have a Result(object), ExecutionFailed(Result), and MemoryExceeded(ExecutionFailed) (names subject to change), it is clear what happened, I can attach memory exceeded-specific information to that error (bad example, since here it makes little sense), and it is clear what a second result interpreter should return. For a better example, when a tool fails because of an internal exception, I find it useful to categorize the error. For example, I might have a case where a tool fails because of an assertion error, and then it is useful to have AssertionError(...) error type but provide further details. But I think the main point here is to keep consistency between different result handlers.

Point 4: Task-definition Format. This is, for example, one of the things that from the outside really feels very tailored towards SV-COMP / a particular style of parametrizing executions; especially its description, it talks about system and specifications, expected verdict is boolean, subproperty is something with conjunction of properties, ..., most of this doesn't make sense for a lot of tools that just want to execute something. And yes, you can press other applications into this format, but as an outsider it doesn't feel nice.

A concrete example: In QComp, having multiple input_files or multiple properties does not make so much sense. We rather have (model file, model options, property file) + list of property keys (something like a sub-property I guess?) together with expected value. For example, this is an excerpt from my QComp helper script definitions:

- model:
    class: prism
    const:
      MAX_COUNT: 2
      T: 12
    path: models/qvbs/ctmc/embedded/embedded.prism
    properties: models/qvbs/ctmc/embedded/embedded.props
    states: 3478
    type: ctmc
  queries:
    - spec:
        expected: 0.08767819037331588
        property: actuators
        type: reach

(This of course is also very tailored towards my case, and it would be cleaner to even have one property key only, i.e. flatten queries)

Some changes I would find useful (I am not saying that these should be done, this is just my personal preference for my personal use case):

Possibility of multiple definitions in one file
Maybe a field tags and/or information
Instead of hard-coding "there are properties with property files etc." have input_files: be a dictionary that then is given to the tool as such. For example input_files: { "model": "model.prism", "properties": "properties.prop" }. Then, properties: is a list containing entries with name: ..., expected_verdict: <any json value> and optional values that again the tool has to understand (name just is the "display name" and doesn't need to be an actual "key" in the properties file.)

I think what should be important to discuss here: Are this task-definitions suppose to be written by hand or generated by tools?

Inspired by this, one suggestion: Would it be possible to add one more task definition format, which is very "dumb": It effectively is a JSON that simply contains a list of all invocations that should be run, together with a) the tool's name, b) the invocations key and c) all relevant files that are needed inside the container (alternatively, if paths need to re-mapped, it could contain a "templated" invocation where remapped paths are then inserted, I can elaborate on this if needed). Then, if whatever format is present is not suitable, one can simply generate such a file and use this as a base, bypassing most of the execution generation. (After all this is the core task of benchexec; take a set of executions and run them, and this would be, I think, super nice. But I dare say that I am speaking rather as a power user here.) I do realize that this might make things tricky, because for extracting results from output a tool might need a certain amount of context, which I cannot judge right now. But I think that if that JSON file simply contains whatever the RunSet etc provide, it should be fine?

Thinking about this, I guess I would like this a lot :)

Did I forget anything?

PhilippWendler commented 6 months ago

Point 1: Scoring / results.py. We agree that this is tied to SV-COMP and could be generalized by providing a more flexible way of interpreting results. The logic would be that we have a "result interpreter" that asks the tool "please tell me what your result is" and then compares that to a provided reference result (if any). A result could be any object (a string, int, float, or something complex). The interpreter is expected to output a verdict (Correct, Incorrect, Unknown, Error, ...) + score + potentially additional information (that only the interpreter can know, not the tool, e.g. for approximation results how close the result actually is, which might be different to score)

Ah, you want this component to be able to add other columns to the table. That can be useful, of course. And it should be easy to add. We should just do this separately and step wise: First make the result classification depend on the Property instance (#996), then add support for QComp properties with quantitative results and the respective scoring, then allow Property to produce arbitrary values in addition to result category and score (#997).

Point 2: Documentation. I didn't mean that SV-COMP is explicitly mentioned, but the workflow and design of specification files etc. has a certain "bias".

The specification (property) files are defined by SV-COMP, not BenchExec, and thus they are SV-COMP specific. But BenchExec only looks at them to decide whether it should compute SV-COMP scores, you can use property files with arbitrary content with BenchExec.

This is something that will arise naturally. If I use benchexec and know that there is interest for it, I can organize my findings and eventually propose changes / extensions to the doc :)

Sure, this is always welcome.

Point 3: Generalizing verdicts from string to objects. I just think this is flexibility for free. Right now, ERROR(reason) is free text and consistency is only ensured by having one result interpretation. If I have a Result(object), ExecutionFailed(Result), and MemoryExceeded(ExecutionFailed) (names subject to change), it is clear what happened, I can attach memory exceeded-specific information to that error (bad example, since here it makes little sense), and it is clear what a second result interpreter should return.

I do not yet see a convincing benefit that would result from this change. In any case, it seems pretty orthogonal to the goal of supporting the QComp use case for BenchExec, so let's please have this discussion separately.

If it turns out that actually the string representation is problematic during implementation, then we can easily think about how to solve it. But right now this is hypothetical.

For a better example, when a tool fails because of an internal exception, I find it useful to categorize the error. For example, I might have a case where a tool fails because of an assertion error, and then it is useful to have AssertionError(...) error type but provide further details.

Absolutely, please do this. There is no need to change anything in BenchExec for this, because this is exactly what we already use for existing tools. Just have a look at this example.

Point 4: Task-definition Format. This is, for example, one of the things that from the outside really feels very tailored towards SV-COMP / a particular style of parametrizing executions; especially its description, it talks about system and specifications, expected verdict is boolean, subproperty is something with conjunction of properties, ..., most of this doesn't make sense for a lot of tools that just want to execute something. And yes, you can press other applications into this format, but as an outsider it doesn't feel nice.

If you "just want to execute something", don't use task-definition files at all, they don't provide value for that, and BenchExec doesn't require them for that. You need task-definition files if you want to tell BenchExec "here is a task that consists of some information that the tool should get as input and some information on how to interpret whether the result is correct".

And of course expected verdict is currently boolean, but I have suggested already to just extend this.

A concrete example: In QComp, having multiple input_files or multiple properties does not make so much sense.

Just always use one input file and property. The format doesn't force you to have more than one. Most of SV-COMP also uses only one input file, for example.

We rather have (model file, model options, property file) + list of property keys (something like a sub-property I guess?) together with expected value. For example, this is an excerpt from my QComp helper script definitions:
- model:
    class: prism
    const:
      MAX_COUNT: 2
      T: 12
    path: models/qvbs/ctmc/embedded/embedded.prism
    properties: models/qvbs/ctmc/embedded/embedded.props
    states: 3478
    type: ctmc
  queries:
    - spec:
        expected: 0.08767819037331588
        property: actuators
        type: reach

I don't understand everything in this example, but to me this looks like it would be well covered by the existing format plus and extension for quantitative expected results?

Some changes I would find useful (I am not saying that these should be done, this is just my personal preference for my personal use case):

Possibility of multiple definitions in one file

Not having this is actually on purpose, because it significantly simplifies handling of the tasks. We discussed this several years ago and had a proposal for how it could look like, but the current structure was preferred in the end. For example, in the result tables you need a clear identifier for every executed run. You also would need a way to tell BenchExec which of those multiple definitions in each file it should execute, because you don't want to always execute all of them. By having only a single definition in each file, all these kinds of problems and questions are avoided.

Maybe a field tags and/or information

Isn't this just the same as the existing options?

Instead of hard-coding "there are properties with property files etc." have input_files: be a dictionary that then is given to the tool as such. For example input_files: { "model": "model.prism", "properties": "properties.prop" }. Then, properties: is a list containing entries with name: ..., expected_verdict: <any json value> and optional values that again the tool has to understand (name just is the "display name" and doesn't need to be an actual "key" in the properties file.)

The property file is something that BenchExec needs to know about, so it would be treated specially anyway. I can see that what you suggest would also be possible, but currently we have the existing format and so far I haven't seen anything that wouldn't fit into the current format plus the discussed extension, so why completely redesign the format?

Honestly, this part of the discussion doesn't feel 100% productive right now. I would prefer to talk about format changes with concrete use cases and examples that do not fit into the existing format.

I think what should be important to discuss here: Are this task-definitions suppose to be written by hand or generated by tools?

We don't suppose either. You can write them by hand or generate them, as you prefer.

Inspired by this, one suggestion: Would it be possible to add one more task definition format, which is very "dumb": It effectively is a JSON that simply contains a list of all invocations that should be run, together with a) the tool's name, b) the invocations key and c) all relevant files that are needed inside the container (alternatively, if paths need to re-mapped, it could contain a "templated" invocation where remapped paths are then inserted, I can elaborate on this if needed). Then, if whatever format is present is not suitable, one can simply generate such a file and use this as a base, bypassing most of the execution generation. (After all this is the core task of benchexec; take a set of executions and run them, and this would be, I think, super nice. But I dare say that I am speaking rather as a power user here.) I do realize that this might make things tricky, because for extracting results from output a tool might need a certain amount of context, which I cannot judge right now. But I think that if that JSON file simply contains whatever the RunSet etc provide, it should be fine?

BenchExec always needs tool-specific knowledge, e.g., for constructing the command line, and we cover this with the tool-info module. If you are suggesting something that avoids tool-info modules (I am not sure if I interpret this correctly), I don't see how this would make sense in practice. (And it would stray pretty far from the use case of BenchExec in QComp, because there you obviously need parsing of the tool output, so more tool-specific code.)

If you are not suggesting to avoid tool-info modules, then we have already what you seem to want. You only need to write a benchmark definition and define the runs (input files + command-line arguments) there, no need at all for property files and task-definition files. Of course BenchExec then also cannot classify the result into correct/wrong, but everything else works. So there is no need for another format to cover the same use case.

incaseoftrouble commented 6 months ago

The specification (property) files are defined by SV-COMP, not BenchExec, and thus they are SV-COMP specific. But BenchExec only looks at them to decide whether it should compute SV-COMP scores, you can use property files with arbitrary content with BenchExec.

That's not what I meant. This was again hinting at the visibility issue. I know I can use them for something else. But at first glance this is not clear / was not clear to me. But lets skip this, I think this would be resolved by the other points.

I do not yet see a convincing benefit that would result from this change. In any case, it seems pretty orthogonal to the goal of supporting the QComp use case for BenchExec, so let's please have this discussion separately.

Agree with the latter :)

I don't understand everything in this example, but to me this looks like it would be well covered by the existing format plus and extension for quantitative expected results?

It is coverable, but quite cumbersome / it does not feel correct to do it (again, the visibility point). (More below)

The property file is something that BenchExec needs to know about

Isn't this a severe constraint? Why is a property file relevant? What makes it different from the system file? E.g. QComp will likely migrate to JANI, which has system and property description in one file.

Honestly, this part of the discussion doesn't feel 100% productive right now. I would prefer to talk about format changes with concrete use cases and examples that do not fit into the existing format.

Sorry if this so undirected! I think this partly is a rubber-ducking exercise, I am only now beginning to exactly understand my issue.

Also: These suggestions are motivated by my personal experience but I am trying to anticipate what requirements others might have.

BenchExec always needs tool-specific knowledge, e.g., for constructing the command line, and we cover this with the tool-info module. If you are suggesting something that avoids tool-info modules (I am not sure if I interpret this correctly), I don't see how this would make sense in practice.

I want to be able to tell benchexec "I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result". No further information should be required.

The overall point I want to make: I do not "understand" the point of, for example, categorically differentiating between property files and input files. This seems specific to a particular use case. I get where this comes from, but it seems unnecessarily complex for the main purpose (run stuff in a controlled environment) and not canonical for most applications where I would use it. (Quite often, tools might not even speak about properties at all but still produce a result.) Yes, probably all use cases can somehow be squeezed into the current format, but it doesn't feel right to me. Maybe that's just me. But, to me, that is the reason why I got the feeling that benchexec isn't meant to be used this way.

I would expect that a standard task definition contains (i) a tool name (ii) a set of (named) files (iii) an optional reference result (+ how to handle it) (ideally this could be anything, i.e. also the path to a file, or a complex results like a witness strategy / certificate -- its the job of the tool / result handler to interpret the result) (iv) arbitrary further keys that are passed to the tool info, which is then responsible to assemble the command line (for what its worth, point (ii) could actually be dropped and the tool info can be responsible for exacting the set of paths). (The tool info could also be required to define a unique key)

For example:

tool: mychecker
limits:
  cpu: 1 
  memory:
    amount: 1
    unit: GiB
  # etc., these are fixed 
instance: # all keys below are "free form" and passed to the tool-info as is. instance is not a good name here
  system: foo.file
  properties: prop.file
  sub_property: unreach-call
expected: false # optional, any format, to be understood by tool info

tool info would then report "these two files are needed", benchexec gives the absolute paths where these files will be put, tool info assembles the execution, e.g. mychecker /path/in/container/foo.file --properties /path/in/container/prop.file --property unreach-call and derives a unique key for this execution, the tool is executed + results are gathered using the given key, tool info validates the result (no need for result handlers!) and computes a score, tool info extracts columns from the output and the table is assembled. (Yes this puts a lot of responsibilities on the tool info, but it should be reasonably simple to provide sensible base implementations that even provide backwards compatibility, I am happy to sketch how I would imagine that) Note: I would guess that the above file probably rather should be in JSON and usually machine generated from a more concise, domain-specific format.

Thinking about it, I guess all my points boil down to this. Again, I do see that this functionality is already possible with the current input format once you know how to do it. I am, however, suggesting that such a format makes benchexec easier to use. Maybe I am also just not understanding the generic <benchmark> definition precisely. But, to me, it "wants" a structure like one tool -> few configurations / options / rundefinition -> many inputs, i.e. have different "configurations" of the tool which are then generically applied to all inputs. This is not how most of my benchmarks look like, for example I would definitely have a unique set of command line options for every instance. In other words, I require one rundefinition for every single input, and this rundefinition cannot be used for any other input.

PhilippWendler commented 6 months ago

The property file is something that BenchExec needs to know about

Isn't this a severe constraint? Why is a property file relevant? What makes it different from the system file?

Note that BenchExec only needs to know that the property file is the property file. It does not need to know or interpret its content. The property file can even be empty. And if the benchmarked tool doesn't accept a property file as input, BenchExec can use the property file without passing it on to the tool. So everything is possible, it is not a severe constraint.

The reason for BenchExec knowing about the existence of property files is the following:

It is common for formal-method tools to get as input some model and some instruction what do to with the model (which we call the property): check whether a certain part of the model is reachable, check whether the model has infinite paths, check a program for memory safety, create test cases covering all branches, create test cases covering all statements, check for satisfiability, etc.
It is common to have one model and sometimes wanting to perform one of the above and at other times something else. Both Test-Comp and SV-COMP have this use case, for example, and SV-Benchmarks, the largest collection of software-verification benchmarks, even has one model that is used in 8(!) such different cases.
While we decided in the past that it simplifies handling of tasks significantly if we do not support more than one (variant of) a model in a single task-definition file, it would be quite redundant to have each instance of the tuple (model, property, expected verdict) in a separate file. In particular, it would duplicate the definition of the model and introduce the risk of inconsistencies there.

So in order to solve these requirements, we decided to have task definition files should support definitions of (model, {property -> expected verdict}). That creates the need to tell BenchExec "use this task definition and pick property X from it", so BenchExec needs a way to understand what property X is. But of course we do not want to restrict BenchExec to only certain hard-coded and known properties. This is solved by the current system: The property is defined in a file, the task definition refers to that file, and the benchmark definition (which defines which tasks are to be benchmarked) as well. BenchExec then just needs to match the references to the property file in benchmark definition and task definition in order to select tasks. For that it needs to know what the property file is, but not more. You can think of BenchExec using the property file only as an opaque key.

Furthermore, not all tools actually support these property files, even for cases in SV-COMP where they exist and are defined by the rules. Keeping the property file separately from the rest of the input allows the tool to be benchmarked nevertheless. This is also important for example for taking the benchmark set from a competition with a property file and running a (potentially old) tool on it that has never heard of this property-file format. So it provides more flexibility and makes more use cases possible that the property file is separate from the model file!

Of course this use case was motivated by SV-COMP and Test-Comp with their respective sets of properties, but that doesn't mean that it is specific to it. I know some communities do not have different properties, they might use only one property all the time. But these are not excluded, worst case is that they need to have a dummy property (a single file, can be empty) that is their "definition" of a property for the sake of BenchExec, and no other tool needs to know about it.

BenchExec always needs tool-specific knowledge, e.g., for constructing the command line, and we cover this with the tool-info module. If you are suggesting something that avoids tool-info modules (I am not sure if I interpret this correctly), I don't see how this would make sense in practice.

I want to be able to tell benchexec "I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result". No further information should be required.

First, benchexec is primarily intended for cases where you do not have this only 1x, but potentially many times. If you just want to execute a tool once, you would typically just start it directly, or (if you want measurements and containerization) we provide runexec for this use case. But if you just execute the tool once, you typically don't really need a tool-info module to create the command line for you, nor automatic interpretation of the results. (For the use case of executing a single run, benchexec is anyway not so convenient because you for example need to run table-generator and open an HTML file to look at the tool output or manually pull it out of a ZIP archive. runexec serves this use case better.)

So the main use case of benchexec is "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1000".

Second, as explained already, if you exclude the "I expect the following result" part of that requirement, benchexec supports the use case already without task definitions and properties. You only need the benchmark definition. That single benchmark definition can easily define many tasks, via wildcards or "set files" (trivial files containing sets of wildcards).

But how do you define the expected result if you have thousands of tasks? We could add a way to define the expected result in the benchmark definition, but that would be pretty bad. You could not use wild cards anymore, instead you would have a huge list of individual items, each with input file and expected result. But the benchmark definition is tool-specific, so it is not reusable. As soon as you would want to benchmark two tools (and this is of course extremely common) you would have to duplicate these thousands of (input file, expected result) tuples for every single tool. We do not consider this acceptable.

Furthermore, public availability of good and easy to use benchmark sets is good for research. And the mapping "input file -> expected result" is really part of the definition of the benchmark set, and it should be distributed together as part of the benchmark set in a way that is independent of any concrete benchmarking or benchmarked tool. So again, this mapping needs to be outside of the benchmark definition. There is an important conceptual difference between "benchmark set" and "definitions of one concrete benchmarking experiment", and expected results are part of the former.

(Btw. you would end up with exactly the same argument if you start thinking about how to write down the definitions of many tasks, each of which consists of several input, i.e., sets of sets of input files.)

So in the end, we absolutely need a way to write down many (input files, expected result) tuples somehow. This resulted in the creation of the existing task-definition format, because we are not aware of another existing format that would cover this.

Now you might say that for a use case of "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1" this creates unnecessary overhead. Yes, in principle, but compared to "(I want to run this command line, I require these files to be present) times 1" and "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1000" we expect this particular case to be relatively rare. And after all it is completely doable with BenchExec, just with a little bit more of input needed.

The overall point I want to make: I do not "understand" the point of, for example, categorically differentiating between property files and input files.

Explained above.

This seems specific to a particular use case. I get where this comes from, but it seems unnecessarily complex for the main purpose (run stuff in a controlled environment)

Again: If you only want to "run stuff in a controlled environment", you don't even need it. You only need it you want more than that.

I would expect that a standard task definition contains (i) a tool name (ii) a set of (named) files (iii) an optional reference result (+ how to handle it) (ideally this could be anything, i.e. also the path to a file for complex results like a witness strategy / certificate -- its the job of the tool / result handler to interpret the result) (iv) arbitrary further keys that are passed to the tool info, which is then responsible to assemble the command line (for what its worth, point (ii) could actually be dropped and the tool info can be responsible for exacting the set of paths). (The tool info could also be required to define a unique key)

For example:
tool: mychecker
instance: # all keys below are "free form" and passed to the tool-info as is
  system: foo.file
  properties: prop.file
  sub_property: unreach-call
expected: false

No! This would lead to a huge amount of duplication of data, prevent easy tool-independent definition of benchmark sets, etc. The benchmarked tool needs to be separate from the task definition.

And if you want BenchExec to check expected results, then it also needs to know about stuff like sub_property (or required precision for qualitative tasks etc.). So this also needs to be in a specific format, not free form.

And once you implement these two changes, you end up almost exactly what we have with the existing format.

tool info validates the result (no need for result handlers!) and computes a score

No! The logic for classifying results and computing scores can be quite complex, it must not be duplicated for every single tool, but implemented only once. Furthermore, many tools should be benchmarkable in more than use case, e.g., for participation in two different competitions. Supporting all this in every tool-info module would create a quadratic number of cases.

And it is really important that tool-info modules remain as small and easy to implement as possible, because even so developers sometimes struggle with writing them (plenty of PRs in this repo are the proof).

From the point of the competition organizer, it is also important that they do not need to trust the implementers of the tool-info modules, otherwise the organizers would need to review (and re-review on every change) all the involved tool-info modules. Right now, this is not necessary, the tool-info module cannot do anything to cheat (all the code inside it is even executed in a container).

So we are always glad to receive feedback and are open for suggestions, and we certainly want to extend BenchExec to cover new use cases and are also open for format changes. But I do not think it is good to add specific formats tailored for every single individual use case if there is already a way to express this use case with a different format with only slight overhead. And we would not want to encourage practices like mixing the abstract task definitions that should be part of a benchmark set with tool-specific definitions of a specific benchmarking experiment. If you have concrete use cases along these lines that are not reasonably covered by the existing stuff, then please lets talk about them!

PhilippWendler commented 6 months ago

Your edits overlapped with my writing of the response, so my response is based on an older version of your post. I am not sure what the differences are, but I can additionally respond to the following paragraph:

Maybe I am also just not understanding the generic <benchmark> definition precisely. But, to me, it "wants" a structure like one tool -> few configurations / options / rundefinition -> many inputs, i.e. have different "configurations" of the tool which are then generically applied to all inputs.

Yes. Because this is really the most common case, for example in competitions or when running benchmarks with huge benchmark sets and many tools for a paper.

This is not how most of my benchmarks look like, for example I would definitely have a unique set of command line options for every instance. In other words, I require one rundefinition for every single input, and this rundefinition cannot be used for any other input.

We expect this use case to be more rare than the above, and in particular exist only with much smaller numbers of both tasks and tools. Because if you have > 1000 tasks and > 10 tools, you would of course not think about manually defining a unique set of command-line options for every instance, right?

But nevertheless, there are ways to do it with current BenchExec (for n tasks and m tools):

Write a benchmark definition file with many <tasks> tags. Each <tasks> tag contains definitions of input files and command-line options, and all options are applied to all input files. So you need one <tasks> tag for every unique set of command-line options. And if you have more than one tool, you need to duplicate this structure for every tool (in a separate file). You actually do not need more than one <rundefinition>. Effort is O(n*m). (If you don't like writing this XML, you could generate it.)
Find a way to abstract from the command-line parameters into tool-independent definitions. Create a task-definition file for every task, and put those definitions into options: (can have arbitrary content). Extend the tool-info module of every tool that you want to benchmark with logic that derives the concrete command-line options from the tool-independent definitions. The benchmark definition for every tool then just includes the tasks and potentially defines additional command-line arguments. Effort is O(n+m) but you first need to think about how to write down the required information independently from the tool.
Create a task-definition file for task and put something like tool_options: tool_X: "--foo" into options:. Extend the tool-info module of every tool that you want to benchmark with logic that appends these options to the command line. The benchmark definition is the same as for the previous option. Effort is O(n*m). (I consider this option kind of a hack and not the intended use case of task-definition files, but it is possible.)

Of course I would recommend 2. (and it is the only practical way for a competition), but 1. and 3. are already possible. And I think that 1. and 3. are not significantly worse than the format ideas you have presented, or do you?

incaseoftrouble commented 6 months ago

Thank you for taking your time to answer! I hope my "resistance" doesn't come across in a negative way, I really appreciate the interaction, and just want to explain how I think about this / perceive it.

I can follow the reasoning, but disagree with some of the points. But I have now also understood several things that weren't clear to me, which leads me to several concrete (and not so drastic) suggestions at the bottom. So bear with me :) I try to silently skip over most points that are now resolved for me.

Let me start by saying that I think that quite a bit of the confusion comes from the fact that how you and I interpret the meaning of a property file is subtly different (I comment on this below)

Note that BenchExec only needs to know that the property file is the property file. It does not need to know or interpret its content. The property file can even be empty. And if the benchmarked tool doesn't accept a property file as input, BenchExec can use the property file without passing it on to the tool.

Being devils advocate: Well that then really sounds like it is completely superfluous, doesn't it? "We require it, but you can always leave it empty." (I comment on that below)

It is common for formal-method tools to get as input some model and some instruction what do to with the model (which we call the property)

Yes, but very often not in the form of a file (or a separate file). It seems weird to require that this is a file. Concrete example, it is likely that QComp will use the JANI format for inputs, then no tool will have a separate property file. All properties are written in the "model description" file. Yes, we can give every single definition the same empty dummy file, but it just seems weird then.

So the main use case of benchexec is "(I want to run tool name x, I require these files to be present, please pass the following information to the tool-info module to build the invocation, I expect the following result) times 1000".

Yes, my use case would be to run a few thousand invocations several times each (maybe not x1000 but x5 or x10, and >20k executions in total). QComp isn't large by any means, but there are a few hundred model x property combinations and 10ish tools. We also are thinking about running the evaluation in a continuous fashion: Authors can upload a new version any time and then their new version is re-evaluated, so we also would get many repeats of that.

Furthermore, public availability of good and easy to use benchmark sets is good for research. And the mapping "input file -> expected result" is really part of the definition of the benchmark set, and it should be distributed together as part of the benchmark set in a way that is independent of any concrete benchmarking or benchmarked tool. So again, this mapping needs to be outside of the benchmark definition

Yes, absolutely agree!

No! This would lead to a huge amount of duplication of data, prevent easy tool-independent definition of benchmark sets, etc. The benchmarked tool needs to be separate from the task definition.

This is a separate point I wanted to make, sorry if this was not clear. This "format" is, effectively, a "serialization" of all individual runs you want to have. It is not meant to be what you publish, or use as the de-facto definition of your benchmark set. But having such an intermediate representation disentangles "How do I compactly represent my benchmark set" from "What are the things I need to run". It gives you the freedom to write your benchmarks in a domain-specific language and then generate these definitions on the fly, without needing to go to task-definitions etc. After all, this literally is what the property file -> run set conversion is doing. Just that this is not written out as a file in between. Would give you more flexibility, but, now that I understand the things I comment on at the bottom, it definitely is a separate issue.

EDIT: Ignore that, this is better solved by runexec I now realized (also commenting on this below)

No! The logic for classifying results and computing scores can be quite complex, it must not be duplicated for every single tool (...)

Subclassing / composition could take care of this. But its not my main point, I don't really care if the tool definition or a separate result handler does this, it was just a side remark.

(sorry for the edits :D replying to the second point.)

Yes. Because this is really the most common case, for example in competitions or when running benchmarks with huge benchmark sets and many tools for a paper.

I really do not think so. Command line options specific to model-property-tool-combinations often are part of the input in my experience (also see below).

Because if you have > 1000 tasks and > 10 tools, you would of course not think about manually defining a unique set of command-line options for every instance, right?

Sure! I did exactly that by effectively implementing the things I discussed as above. I have a few thousand instance definitions of the form roughly model x property type x required precision x expected result and ~10 different "tool info"-like implementation that given this tuple (and further information) generate the set of environment variables + exact invocation. Instead of running these directly, it would be super cool (but, again, I realize now, orthogonal to this issue) if I could dump these into a big JSON and tell benchexec "please just run these things and give me some numbers".

EDIT: I realized this is sort of what the Python API for runexec would give me. Sorry. Feel free to ignore that part. A side comment here: There already are a lot of domain-specific benchmark set definitions. It is much easier to export to such a low-level "front end" than it is to re-write all the definitions (which also might be much more compact and flexible than taskdefinitions or the benchmark.xml files; e.g. I don't see a simple way how to filter out very specific execution tuples). So it might anyway be cool to have a standard format to say "please run these N executions" for runexec. So something like runexec executions.json and then this just does its stuff.

Option 2 / using options

Aha! I misunderstood options :-) That definitely is what I want (and what I did in my own solution)

I think we are converging on something. At least I begin to understand what is happening.

I see what your point with the task definitions is. I definitely agree with the sentiment. Effectively, I have built the same thing in spirit (effectively doing Option 2). Some points I have on this: 1) Not every tool / every configuration of a tool supports every property I would put in such a task definition file. (The kind of models / properties / guarantees different tools provide in QComp is very heterogeneous. We do not even have one uniform language to specify models.). It would be annoying to split up the task-definitions by type of property. I think tool info currently cannot gracefully indicate that they simply do not support a particular property. It would not be good to classify this as an ERROR. This I would propose to add. (Related to this: It might be interesting to be able "classify" or "tag" properties. For example, a "reachability" property or a "mean payoff" property, and then be able to group results by this type or restrict execution to certain tags. But again, something for another issue.) 2) There might be property-specific options, in the sense that these define the property in question. Concrete example: My "instance specific input" might be model.file --property_type mean_payoff --reward_source XY --coalition 1,2,3 --optimization max. model.file naturally is the model definition, and then I "manually" specify a "players 1-3 maximize mean payoff based on reward XY" property (because the property file language does not support this kind of expression). A different tool might need a different command line. So I would want to have a "property" that I call mp_XY_max and has options coalition: 1,2,3, optimization: maximize, reward: XY. But this should be reasonable to add, right? 3) (partly based on 2) ) I would still advocate for allowing for a generic property: descriptor as an alternative to property_file:. I do realize that you can just dump whatever you write after property: into a file which is passed to property_file: or have an empty input, but that feels silly. If having an empty input is fine, why can't we make the field optional and, if absent, pretend it is empty? Also, this does not work if I need a generic property file and property-specific but tool-independent options, which is normal for my applications, see example below. Also: It might, dually, be the case that one property requires multiple files, e.g. "given the strategy given in file x, and the property in file y, what is the value of y if x is applied to model z?" (the strategies are specific to a property and are typically quite large!). 4) I am wondering about subproperty. I understand this as follows: In your application, a "property file" is a collection of things that should simultaneously be checked for a system, correct? And specifying a "subproperty" says "Only look at this one property of all the given properties". So, as I understand it, right now the idea behind writing property_file is "check all properties given in this file" and property_file + subproperty is "check this particular one". This interpretation is specific, but the underlying idea is similar to my situation. For, e.g., QComp, a "property file" is simply a collection of different queries you could potentially be interested in. They can be false, true, quantitative, multi-dimensional, whatever, and, in particular not related to each other at all (see here for a random example of a "property file") and always specific to a concrete model (family). These properties can be referred to by name or index (which, I guess, would be sort of like the "subproperty"). But we never run an evaluation like "please check all the properties in this given file" and a "property file" is not interpreted as "these are things that all should hold on the model". Most queries are "I want to know the value of XY". Abstracting both of these views, isn't the properties list just saying "this is a list of things I want to know about the model" (and a tool should answer). Would it not be a reasonable idea to slightly generalize this point, i.e. to align what can come under properties more with this view?

Thinking about 2+3+4: Isn't all this property_file, property-specific constants, etc all just a "codeword" for "property options"? We have a property with a fixed name, an (optional) expected result, and then whatever needs to be said in order to concretely describe the query / property.

Summarizing: I propose to be able to write task-definitions like the following + allow tools to indicate that they do not support a particular query (I know that this opens some questions for how to generate tables)

format_version: '3.0'
input_files: 'model.prism'

properties:
  - name: bounded_reach
    file: properties.props
    options:
      constants:
        T: 12 # This is a step-bound specific to this *query*, with T: 20 the result would be different
    expected_result: 0.89
  - name: mp_serve
    property:
      type: mean_payoff
      reward_structure: serve
      coalition: [ 1, 2 ]
      optimization: maximize
    expected_result: # A precise fraction
      numerator: 13254
      denominator: 12213344

  # Alternative, more flexible approach
  - name: xy # This now is the globally unique key (or unique within this file and treat (model key, property key) as globally unique
    expected_result: 1
    definition: # After this is free form with some standardized key names
      file: properties.props
      constants: { T: 12 }
      key: property_xy # the "subproperty"

options:
  constants:
    N: 12
    reset: true

I think that the intention and meaning of the keys should be clear. This would still lead to some duplication (one file for each instantiation of the model), but this is unavoidable I think.

There is a subtle question, in particular for qualitative queries: Is the precision requirement (e.g. exact or approximate) part of the property or the tool configuration? I think the latter is "easier" because then the tool definition can declare a matching result handler.

incaseoftrouble commented 6 months ago

A second comment, because I find it important: Thinking more on this, I really think that runexec should be much more prominent. I daresay it appeals to a much wider audience than benchexec. Runexec is much simpler to use and (sort of) a drop-in replacement for running with time or similar (correct?). Practically any paper could use this. Benchexec feels much more tailored to expert users and definitely requires at least some kind of specific setup (you need to write some configuration files at least) and probably does not pay off for a single evaluation that doesn't compare a large number of functionally similar tools.

If I understand it correctly, runexec already brings most of the things I want, and benchexec is "just" some organizational stuff around it that many people probably just want to hack on their own anyway. But this really is not apparent (at least it wasn't to me). I was under the impression that in order to get the full "how to do a benchmark correctly" thing, you need benchexec. But, at least this is how I understand it, this is not at all true, and benchexec is "just" runexec + a way to concisely write runs + a way to merge results into a table. The whole time I have talked about trying to disentangle things basically was to carve out what I now think that runexec is already doing :man_facepalming:

So, the above suggestions for the task format are still valid, but now have a completely different quality for me. I'm very sorry for my misunderstandings that resulted in so much confusion.

PhilippWendler commented 6 months ago

Note that BenchExec only needs to know that the property file is the property file. It does not need to know or interpret its content. The property file can even be empty. And if the benchmarked tool doesn't accept a property file as input, BenchExec can use the property file without passing it on to the tool.

Being devils advocate: Well that then really sounds like it is completely superfluous, doesn't it? "We require it, but you can always leave it empty." (I comment on that below)

But that is not the case. You can leave it empty only if your tool doesn't care about it or does not support it.

It is common for formal-method tools to get as input some model and some instruction what do to with the model (which we call the property)

Yes, but very often not in the form of a file (or a separate file).

We know. This is why there is no requirement at all for the tool to be aware of the property file.

It seems weird to require that this is a file.

Consider the alternatives. BenchExec anyway needs to support the concept of property files, because there are use cases that require BenchExec to use them for picking out the expected verdict from the task-definition file and for determining which scoring schema to use. So the alternative would not be to replace property files with some "non-file property", but only to support both property files and something else. Given that not every benchexec user even needs this functionality at all (only if you want result classification) and that every "non-file property" use case can be handled using property files by imposing on the user only the trivial cost of having an additional file lying around, the cost-benefit trade-off of supporting "non-file properties" additionally does not seem convincing so far.

Concrete example, it is likely that QComp will use the JANI format for inputs, then no tool will have a separate property file. All properties are written in the "model description" file. Yes, we can give every single definition the same empty dummy file, but it just seems weird then.

Even it if is weird, you certainly agree that it doesn't hurt or complicates stuff in any relevant manner, right?

Yes, my use case would be to run a few thousand invocations several times each (maybe not x1000 but x5 or x10, and >20k executions in total). QComp isn't large by any means, but there are a few hundred model x property combinations and 10ish tools. We also are thinking about running the evaluation in a continuous fashion: Authors can upload a new version any time and then their new version is re-evaluated, so we also would get many repeats of that.

No problem at all, and exactly the intended use case for benchexec. For example, SV-COMP'23 had 5 million executions, and 39 million executions if you include the pre-runs for testing.

Yes. Because this is really the most common case, for example in competitions or when running benchmarks with huge benchmark sets and many tools for a paper.

I really do not think so. Command line options specific to model-property-tool-combinations often are part of the input in my experience (also see below).

Because if you have > 1000 tasks and > 10 tools, you would of course not think about manually defining a unique set of command-line options for every instance, right?

Sure! I did exactly that by effectively implementing the things I discussed as above. I have a few thousand instance definitions of the form roughly model x property type x required precision x expected result and ~10 different "tool info"-like implementation that given this tuple (and further information) generate the set of environment variables + exact invocation.

You are confirming exactly my point, even though you claim the opposite. You are not manually defining a unique set of command-line options for every single instance. You are generating the command-line options from a few specific facts for each instance. This is exactly what the tool-info module cculd do based on options: in the task-definition file.

Or in many cases you can do that just with benchmark definitions. Typically you have some command-line options that you always want to pass, some command-line options that you want to pass for a specific (set of) tasks, and some command-line options that you want to vary between (e.g., tool configurations that you want to compare or so). You can specify these three kind of command-line options in the benchmark definition (<option> can be on the root level, or inside <tasks>, or inside <rundefinition>) and benchexec will generate the correct combinations. An example for this is this benchmark definition of the CPAchecker developers. Doesn't this look like it would support your use case (worst case by having many <tasks> and/or generating this from some other high-level format)?

EDIT: I realized this is sort of what the Python API for runexec would give me. Sorry. Feel free to ignore that part. A side comment here: There already are a lot of domain-specific benchmark set definitions. It is much easier to export to such a low-level "front end" than it is to re-write all the definitions (which also might be much more compact and flexible than taskdefinitions or the benchmark.xml files; e.g. I don't see a simple way how to filter out very specific execution tuples). So it might anyway be cool to have a standard format to say "please run these N executions" for runexec. So something like runexec executions.json and then this just does its stuff.

No, don't use runexec for this. If you use runexec for many runs, you will have to implement a lot of things around it, such as storing results, and you will miss out on all the benchexec features such as the tables. And we also will not implement a way to give N runs to runexec and write out the result data in some file, because we have already that: it is called benchexec.

If you already have benchmark definitions in your own format, then instead just generate a benchmark definition for benchexec. This will be the same effort as generating your hypothetical executions.json. In the worst case it will be a large file with a <tasks> tag for every single task, but for auto-generated files nobody should be bothered.

Not every tool / every configuration of a tool supports every property I would put in such a task definition file. (The kind of models / properties / guarantees different tools provide in QComp is very heterogeneous. We do not even have one uniform language to specify models.). It would be annoying to split up the task-definitions by type of property.

Of course! This is also the case in SV-COMP, for example. BenchExec never takes all properties from a task definition and runs the tool with each of them. It only takes the property that you have explicitly chosen in your benchmark definition. So if you want to benchmark a tool on tasks with two different properties, you need to specify both properties in the benchmark definition. The properties that you don't want to run the tool for should simply not appear in the benchmark definition.

There might be property-specific options, in the sense that these define the property in question. Concrete example: My "instance specific input" might be model.file --property_type mean_payoff --reward_source XY --coalition 1,2,3 --optimization max. model.file naturally is the model definition, and then I "manually" specify a "players 1-3 maximize mean payoff based on reward XY" property (because the property file language does not support this kind of expression). A different tool might need a different command line. So I would want to have a "property" that I call mp_XY_max and has options coalition: 1,2,3, optimization: maximize, reward: XY. But this should be reasonable to add, right?

There are two ways to do this already now: You can put these options into options: of the task definition, even though you might argue that these are more related to the property and not to the model, and you would then need a separate task definition for every (model, options) combination.

Or you simply put all these property options into a file, and then simply use that as your property file in BenchExec. You are free to choose how this file looks like. This is what I would recommend. The parsing of the file and the derivation of the command-line options could in principle be in the tool-info module, but our experience is that it is often better to move this logic to the tool side. A typical solution for competition participants would be to have a small wrapper script around their tool that takes the property file, parses it, and then calls the actual tool with the desired options. Plenty of SV-COMP participants do that. It simplifies competition organization, because tool developers often want to tweak the logic for deriving parameters from property options. And workflow-wise it is easier if the tool-info module stays stable (also for backwards compatibility across BenchExec versions and competition instances!) as long as possible and only the tool archive is repeatedly updated by the submitters.

So we can discuss further whether such property options should be added to BenchExec (again better in a separate issue for this exact question), but given the last point about keeping often-changing logic out of the tool-info module, I right now think the "property options in property file" solution is preferable.

(partly based on 2) ) I would still advocate for allowing for a generic property: descriptor as an alternative to property_file:. I do realize that you can just dump whatever you write after property: into a file which is passed to property_file: or have an empty input, but that feels silly.

This is possible in principle, but I am not convinced yet. Note the minimal benefit and the above discussion on "non-file properties". If you want to discuss further, please open a separate issue for this exact suggestion.

If having an empty input is fine, why can't we make the field optional and, if absent, pretend it is empty?

Supporting something without an explicitly mentioned property at all but with classification of results is a different question. If you want to discuss further, please open a separate issue for this exact question.

Also, this does not work if I need a generic property file and property-specific but tool-independent options, which is normal for my applications, see example below.

Suggestions provided above.

Also: It might, dually, be the case that one property requires multiple files, e.g. "given the strategy given in file x, and the property in file y, what is the value of y if x is applied to model z?" (the strategies are specific to a property and are typically quite large!).

Given that BenchExec uses the property file as some kind of identifier, letting BenchExec know about multiple files wouldn't work well. But you don't need to. You can simply have a file with the tuple (x,y) as your property file for BenchExec.

I am wondering about subproperty. I understand this as follows: In your application, a "property file" is a collection of things that should simultaneously be checked for a system, correct? And specifying a "subproperty" says "Only look at this one property of all the given properties". So, as I understand it, right now the idea behind writing property_file is "check all properties given in this file" and property_file + subproperty is "check this particular one".

No, subproperty is not something like your property options. subproperty is part of the expected result and thus must be kept hidden from the tool. Right now we follow what I recommend above: Everything that the tool should see is in the property file.

This interpretation is specific, but the underlying idea is similar to my situation. For, e.g., QComp, a "property file" is simply a collection of different queries you could potentially be interested in. They can be false, true, quantitative, multi-dimensional, whatever, and, in particular not related to each other at all (see here for a random example of a "property file") and always specific to a concrete model (family). These properties can be referred to by name or index (which, I guess, would be sort of like the "subproperty"). But we never run an evaluation like "please check all the properties in this given file" and a "property file" is not interpreted as "these are things that all should hold on the model". Most queries are "I want to know the value of XY".

Given that the paragraph seems to be based on a misunderstanding of subproperty, I am not sure how to understand it. But the "XY" in your last sentence is exactly what would be the "property" in BenchExec terms, and what you describe about QComp properties is just like for the properties in SV-COMP and Test-Comp, except that each of the properties there is typically interesting for many models and not just a single one.

Thinking about 2+3+4: Isn't all this property_file, property-specific constants, etc all just a "codeword" for "property options"? We have a property with a fixed name, an (optional) expected result, and then whatever needs to be said in order to concretely describe the query / property.

Note the important distinction between those options that (a) define the query (and which should be passed to the tool) and those options that (b) define the expected verdict (and should not be passed to the tool). But yes, apart from this important distinction, there is no further relevant difference between different "property options" inside (a). This is exactly why BenchExec does not look further into the parts of (a) nor does it want to bother with parts of (a), but just lets you define all of (a) in an arbitrary file in a format you choose. By using a file here, BenchExec gives you full flexibility and freedom and is as independent as possible from the shape or contents of the property. This seems like something that you would like. :-) (b) is relevant for BenchExec's result classification, of course, and thus defined in explicitly supported ways in the task definition.

As discussed above, we could consider allowing to specify parts of (a) differently than inside a file, but it seems to mostly complicate stuff.

format_version: '3.0'
input_files: 'model.prism'

properties:
  - name: bounded_reach
    file: properties.props
    options:
      constants:
        T: 12 # This is a step-bound specific to this *query*, with T: 20 the result would be different
    expected_result: 0.89
  - name: mp_serve
    property:
      type: mean_payoff
      reward_structure: serve
      coalition: [ 1, 2 ]
      optimization: maximize
    expected_result: # A precise fraction
      numerator: 13254
      denominator: 12213344

My suggestion would be to move all parts of the property except the expected_result into a separate file and reference that as the property file.

  # Alternative, more flexible approach
  - name: xy # This now is the globally unique key (or unique within this file and treat (model key, property key) as globally unique
    expected_result: 1
    definition: # After this is free form with some standardized key names
      file: properties.props
      constants: { T: 12 }
      key: property_xy # the "subproperty"
Sorry, I don't understand the intention of this part at all. But my suggestion above would likely remain the same.

There is a subtle question, in particular for qualitative queries: Is the precision requirement (e.g. exact or approximate) part of the property or the tool configuration? I think the latter is "easier" because then the tool definition can declare a matching result handler.

The question is whether the tool should get this information or not (cf. (a) and (b) above). This defines the answer. If it is (b), the tool-info module must also not get it.

A second comment, because I find it important: Thinking more on this, I really think that runexec should be much more prominent. I daresay it appeals to a much wider audience than benchexec. Runexec is much simpler to use and (sort of) a drop-in replacement for running with time or similar (correct?). Practically any paper could use this. Benchexec feels much more tailored to expert users and definitely requires at least some kind of specific setup (you need to write some configuration files at least) and probably does not pay off for a single evaluation that doesn't compare a large number of functionally similar tools.

Well, we hope that people use benchexec and not runexec except if they either have only individual runs to execute or if they already have some benchmarking framework that covers the same use case as benchexec and only want to replace the lower-level execution part. It would be sad if people would write wrappers around runexec for collecting and handling many results for every paper.

If I understand it correctly, runexec already brings most of the things I want,

If I understand you correctly, you really want benchexec because you have lots of runs to execute, you want a place where to write tool-specific logic to create command lines, you want result files with results of many runs, etc.

But, at least this is how I understand it, this is not at all true, and benchexec is "just" runexec + a way to concisely write runs + a way to merge results into a table.

This is basically correct.

incaseoftrouble commented 6 months ago

Summary at the bottom

Even it if is weird, you certainly agree that it doesn't hurt or complicates stuff in any relevant manner, right?

It reduces clarity. This insight that "this is required but you can just ignore it" is strange at first glance. But with the points below I am beginning to see the point. Again, it is mostly an insight thing, for me "property file" has a very specific meaning which is different from what this property_file means. I am trying to collect all of my "mismatches" and later suggest an extension to the documentation that would have helped me.

You are confirming exactly my point, even though you claim the opposite. You are not manually defining a unique set of command-line options for every single instance. You are generating the command-line options from a few specific facts for each instance. This is exactly what the tool-info module cculd do based on options: in the task-definition file.

Yes and no. My point about these low-level run specifications is exactly to not manually define them, but generate them from however I generate my set of runs. In several cases this is something that I fundamentally cannot (reasonably) express in (hand written) task-definitions. (e.g. I need to filter out specific runs)

No, don't use runexec for this. If you use runexec for many runs, you will have to implement a lot of things around it, such as storing results, and you will miss out on all the benchexec features such as the tables. And we also will not implement a way to give N runs to runexec and write out the result data in some file, because we have already that: it is called benchexec.

That is just a few hundred lines of python code, or? Much less with matplotlib / seaborn. Especially if runexec outputs results in a simple format (e.g. JSON) / I use the API. I could just replace my "run this command in docker" code with "run this command with runexec" in my existing frameworks. I don't really need the table stuff of benchexec, its a nice bonus but I would need to do separate data analysis anyway (e.g. run this in (textual) CI or aggregate specific sets in ways that benchexec cannot possibly know about). (This is not a criticism or lack of features of benchexec, I just believe there is simply no "fits all" solution to data analysis.)

Also, I can't see how I can nicely give an ad-hoc generated set of runs to benchexec. Concrete example, I generate a base-set of runs from my underlying definitions, then filter by tool support, then filter by very specific tool-model-property-constants combinations (e.g. which I know to time out and need to filter to make the evaluation feasible), then filter by certain meta-data I derive for every model-tool-property-tuple, etc. (e.g. I want to say "run all of type reachability with number of states at most 10000, use these five tools and only run the combinations that are supported by all tools")

The only way I see to do that is to then dump all of these specific executions in thousands of ad-hoc generated task-definition files (I also need verification of the results!) or have these task-definitions ready and then generate the benchmark.xml. Indeed, this would be exactly what I need / want from the executions.json. But once I generate these xml every time, I do not see a difference to using RunExec.

(yes this requirement is specific and I don't ask benchexec to support it. I just think that this is more common than you might think, i.e. that people want to have full control over the set of runs instead of defining them as a "product" of tool and property definitions. Which is where runexec is probably the right thing, at least as I see it. For example, I might be interested in the effect an optimization has on the runtime and quickly run the tool on the few dozen models that relate to the change; that is where powerful filtering is helpful.)

Of course! This is also the case in SV-COMP, for example. BenchExec never takes all properties from a task definition and runs the tool with each of them. It only takes the property that you have explicitly chosen in your benchmark definition. So if you want to benchmark a tool on tasks with two different properties, you need to specify both properties in the benchmark definition. The properties that you don't want to run the tool for should simply not appear in the benchmark definition.

So I need to write into the benchmark definition the precise list of all properties I want to test every time? That defeats the conciseness, i.e. reasonably writing these definitions by hand. (Concretely, that would be several thousand entries per tool, each property has a more or less globally unique name.)

(again, as soon as we are in the territory of "generate the set of benchmark definitions" I see no advantage over RunExec - I don't mean this as a bad thing, I am happy with using it. I want to say that there are applications where I think using runexec is justified and should be advertised as such.)

Or you simply put all these property options into a file, and then simply use that as your property file in BenchExec. You are free to choose how this file looks like. This is what I would recommend.

non-file properties

I see your points and I am happy with this.

It will be weird to determine which type of property file I am reading. There are different formats. But something that is doable.

Given that BenchExec uses the property file as some kind of identifier, letting BenchExec know about multiple files wouldn't work well. But you don't need to. You can simply have a file with the tuple (x,y) as your property file for BenchExec.

Wait I am confused. Doesn't benchexec need to know the paths of all files that are needed by the tool-property-combination?

I can see generating (uniquely named) "meta-files" in the style you proposed and in there include all the paths of the actually needed files. Only question now is how I can get these into the isolated container. Can tool-info have / report additional files that need to be there? And / or could a tool-info specify files to write into the container? That would be great.

The parsing of the file and the derivation of the command-line options could in principle be in the tool-info module, but our experience is that it is often better to move this logic to the tool side

Well, in my case, whether I put the logic into tool_info.py or into a run_tool.py is basically the same code. I would prefer to keep it in tool_info simply because the tools are then as "pristine" as possible (for evaluation I write the tool_infos on my own anyway).

No, subproperty is not something like your property options.

Ok that now really seems like a very (SV-COMP) specific thing :) Is that then sort of like a "certificate" / "witness" for a result?

But the "XY" in your last sentence is exactly what would be the "property" in BenchExec terms, and what you describe about QComp properties is just like for the properties in SV-COMP and Test-Comp, except that each of the properties there is typically interesting for many models and not just a single one.

So I would specify which "XY" I want from the property file by using options? But this would make properties_file a non-unique key (I have many XYs in one specification file.)

I can see using meta-files for this, as mentioned above.

As discussed above, we could consider allowing to specify parts of (a) differently than inside a file, but it seems to mostly complicate stuff.

I see the point and agree. But I am now worried about the "unique key" thing. Should a tool info maybe be able to report a key (if needed?). (This can be solved by having a uniquely named meta-file, but again needs a way to specify additional files needed. It would be good if the tool info can report it.)

The question is whether the tool should get this information or not (cf. (a) and (b) above). This defines the answer. If it is (b), the tool-info module must also not get it.

The tool must know the precision requirements. They are, in a sense, just option. So I guess this should be part of the tool configuration. The question is how the result verifier gets to know that. The option might be named different for different tool info. It would be, however, more canonical to have it at the query level and let tools then indicate if they cannot support, e.g., exact result.

Also, note that it is reasonable to want to run the same tool with different precision requirements on different models / properties. This would probably boil down to defining several run-definitions and then splitting up the benchmarks along that dimension, too. (This is what my CI is doing.)

It would be sad if people would write wrappers around runexec for collecting and handling many results for every paper.

... but much better than not using either because they feel it is too much work to adapt to the specific format :) I would see runexec as a way to get "hooked" because it has a (comparatively) very low entry barrier and would drastically improve the quality of results already. (Presentation is something else, but you very likely wouldn't copy-paste the html tables from benchexec into a paper anyway.)

Summary: So I think we are aligning (/ you are convincing me / showing me how to do things)

Points:

A lot of my issues come from misunderstandings. I am offering to contribute changes to the documentation.
I see the point of generic property files. I am wondering how to let tool-info pass or write tool-property-specific files into the container (I was always under the assumption that only declared files are visible to the tool). Being able to write into the container from tool-info would be helpful. Some tools need their input as files.
How to deal with precision requirements.
Can / should tool-info be able to indicate that certain concrete instances are not supported? (I want to think of a benchmark set as a tool independent description of "these are things that one could be interested in" and say "and now run this tool to try to answer all the questions it can answer" in a concise way.)
I think sub_property needs to be better described / named / changed / something.
I also still think that using runexec is the best for my (very specific) use case, where I want full control over the set of runs (in ways which are impossible to nicely specify in a concise, domain-agnostic manner.) The only way I could think of is if we could specify a benchexec --filter filter.py ... which for each run says yes or no. But then again, is that really wanted?

PhilippWendler commented 6 months ago

No, don't use runexec for this. If you use runexec for many runs, you will have to implement a lot of things around it, such as storing results, and you will miss out on all the benchexec features such as the tables. And we also will not implement a way to give N runs to runexec and write out the result data in some file, because we have already that: it is called benchexec.

That is just a few hundred lines of python code, or? Especially if runexec outputs results in a simple format (e.g. JSON).

benchexec consists of a few thousand lines of Python code on top of runexec. And there exists one part of reliable benchmarking that is only part of benchexec and not runexec: the allocation of hardware resources (requirements 3 and 4 in our paper). This is important if you have core limits, i.e., if you want the benchmark to run on fewer cores than your machine has, and especially if you have parallel runs. Core allocation is highly complex, it is already a few hundred lines of code in BenchExec and even that needs to be rewritten currently because it does not support everything that is required. Don't do your own core allocation!

I could just replace my "run this command in docker" command with "run this command with runexec"?

If you don't need hardware allocation, then yes.

Also, I can't see how I can nicely give an ad-hoc generated set of runs to benchexec. Concrete example, I generate a base-set of runs from my underlying definitions, then filter by tool support, then filter by very specific tool-model-property-constants combinations (e.g. which I know to time out and need to filter to make the evaluation feasible), then filter by certain meta-data I derive for every model-tool-property-tuple, etc. (e.g. I want to say "run all of type reachability with number of states at most 10000, use these five tools and only run the combinations that are supported by all tools")

The only way I see to do that is to then dump all of these specific executions in thousands of ad-hoc generated task-definition files (I also need verification of the results!).

Yes, if you want all this including classification of results, that would currently be required. But the task definitions would not need to be ad-hoc. You could for example generate all of them once and then only generate the benchmark definition that selects a subset of the task definitions. And depending on how you structure your task definitions (with systematic file names and directories) and your property files, a lot of the selection and filtering could actually be done by BenchExec without the generator needing to list all tasks individually in the benchmark definition. For example, if you represent some of your important criteria in the file name or directory structure, you can use path patterns with wildcards to include a specific subset of tasks. And if you group these patterns nicely inside <tasks> tags, you can even tell BenchExec on the command line to execute only a subset of the <tasks> tags, so you do not even need to regenerate the benchmark definition for every desired benchmark execution. I would recommend you to try this out, potentially with a subset of the full tasks, in order to get a feeling for how this would look like and which parts of the process would need to be in the generator and which features can be used from BenchExec. Maybe we can also extend some of the task filtering that is available.

That leaves the question of adding a way to define tasks with classification of results but without task-definition files. If it is still desired, e.g., because the set of task-definition files is too huge to generate, we can discuss it further. We actually had this in the past, but the expected verdict was encoded in the file name of the input file to the tool, and this got really ugly once we had more than one property and expected verdict per input file. This is the reason why we created task-definition files and we certainly don't want to go back there. But we might decide to have a way to set the expected verdict in the benchmark definition. It seems, however, that this specific question of task definitions with expected verdicts but without task-definition files is separate enough from the rest that a continuation of the discussion should be in a separate issue if desired.

So I need to write into the benchmark definition the precise list of all properties I want to test every time? That defeats the conciseness (that would be several thousand entries per tool, each property has a more or less globally unique name), i.e. reasonably writing these definitions by hand.

If there is a 1:1 correspondence between task definition and property file in your case, you can solve this by encoding this 1:1 correspondence in the file names and the benchmark definition (think along the line of having taskdef.yml and taskdef.prop and telling BenchExec to use $TASKDEF.prop for each task).

If not, we could add a feature of allowing wildcards resolving to an arbitrary number of property files for a given task, such that you could say "take abc*foo.yml and bar*.prop and create a pair for every tuple of (task definition, property) that results from this and where the property is listed in the task definition file".

What if I need another file in the container? Can I have / report additional files that need to be there? And / or could a tool info specify files to write into the container? That would be great.

First, everything from your host is visible in the container unless you explicitly hide it (with the exception of /tmp). Second, task definitions, benchmark definitions, and the tool-info module all have the option to define "required files", which are input files for the tool that are not explicitly given as command-line arguments. Deriving a configuration of the container from required files is #702 and could be implemented, it is just that nobody has requested this because of the opt-out nature of the default container config.

The parsing of the file and the derivation of the command-line options could in principle be in the tool-info module, but our experience is that it is often better to move this logic to the tool side

Well, in my case, whether I put the logic into tool_info.py or into a run_tool.py is basically the same code. I would prefer to keep it in tool_info simply because the tools are then as "pristine" as possible (for evaluation I write the tool_infos on my own anyway).

As long as you do your own evaluation, there is little difference, yes. It only becomes important if you share tool-info modules. Thus it seems relevant for competitions.

Given that BenchExec uses the property file as some kind of identifier, letting BenchExec know about multiple files wouldn't work well. But you don't need to. You can simply have a file with the tuple (x,y) as your property file for BenchExec.

Wait I am confused. Doesn't benchexec need to know the paths of all files that are needed by the tool-property-combination?

I can see generating (uniquely named) "meta-files" in the style you proposed and in there include all the paths of the actually needed files. Only question now is how I can get these into the isolated container.

Should be resolved by the explanation above.

No, subproperty is not something like your property options.

Ok that now really seems like a very (SV-COMP) specific thing :) Is that then sort of like a "certificate" / "witness" for a result?

No. It is an addition to the expected verdict that is needed to determine whether the tool output is correct or wrong, just like the allowed error interval in #996.

But the "XY" in your last sentence is exactly what would be the "property" in BenchExec terms, and what you describe about QComp properties is just like for the properties in SV-COMP and Test-Comp, except that each of the properties there is typically interesting for many models and not just a single one.

So I would specify which "XY" I want from the property file by using options?

No, each XY is represented as one file.

But this would make properties_file a non-unique key (I have many XYs in one specification file.)

The problem of property_file being a non-unique is exactly one of my arguments against having additional property options directly in the task definition.

As discussed above, we could consider allowing to specify parts of (a) differently than inside a file, but it seems to mostly complicate stuff.

I see the point and agree. But I am now worried about the "unique key" thing. Should a tool info maybe be able to report a key (if needed?)

Again, this is the argument for keeping everything inside property files, such that their name is necessarily unique.

The question is whether the tool should get this information or not (cf. (a) and (b) above). This defines the answer. If it is (b), the tool-info module must also not get it.

The tool must know the precision requirements. They are, in a sense, just option. So I guess this should be part of the tool configuration.

Then it seems like it is conceptually part of the property: "give me the value of X with precision Y".

The question is how the result verifier gets to know that.

The result verifier always needs to decide based on the property.

The option might be named different for different tool info.

If you want to benchmark different tools, there needs to be one tool-independent way for writing down these inputs/options and a mapping layer that translate it into a tool-specific command-line argument (or whatever). BenchExec would use the tool-independent value and the mapping would either be in the tool-info module or a wrapper script (or you could in theory define the property options independently in the property file and in a tool-specific way in the benchmark definition, but I would recommend that only if you generate everything).

Also, note that it is reasonable to want to run the same tool with different precision requirements on different models / properties. This would probably boil down to defining several run-definitions and then splitting up the benchmarks along that dimension, too. (This is what my CI is doing.)

These precision requirements (if relevant for result classification) are simply part of the property from the point of view of BenchExec, together with that you would call the property. You don't even need several run definitions for that.

It would be sad if people would write wrappers around runexec for collecting and handling many results for every paper.

... but much better than not using either because they feel it is too much work to adapt to the specific format :)

Sure.

I would see runexec as a way to get "hooked" because it has a (comparatively) very low entry barrier and would drastically improve the quality of results already. (Presentation is something else, but you very likely wouldn't copy-paste the html tables from benchexec into a paper anyway.)

No, please don't do that, and neither for the plots. But it is really useful during research to open the HTML table and interactively click around, filter, define plots, etc. And, because we of course all care about artifact evaluation, it is extremely cool if in your artifact readme you have a link to an HTML table that directly opens a specific view of a plot that is the same plot as in the paper, such that reviewers can see that the plot in the paper is re-creatable from the raw data, and that with the same link reviewers can re-create the same plot after re-running the experiment to see if their data matches. We have this for example in this artifact, it is the links to figures and tables in the TL;DR section of the readme.

incaseoftrouble commented 6 months ago

benchexec consists of a few thousand lines of Python code on top of runexec.

Sorry I didn't mean it like that! What I meant is writing just the result aggregation needed for one specific evaluation is not too much work I daresay.

This is important if you have core limits, i.e., if you want the benchmark to run on fewer cores than your machine has, and especially if you have parallel runs.

Oh! Yes I definitely need that (and practically any evaluation should do this IMO). But the doc says there is --memlimit and --cores also for runexec? If that isn't possible with runexec / RunExec API then it is out of the question.

I have the luxury of not needing parallel runs, since we have few enough so I don't need this level of management. I see that allocating resources for parallel executions is completely different and even more involved. I "just" want single-core + memory limits, clean isolation (e.g. "reserving" one physical core and pinning the process to one logical core), and resource usage like peak memory.

Don't do your own core allocation!

Absolutely, exactly this is the (sole) reason why I want to use BenchExec. Getting resource isolation right is super hard.

For example, if you represent some of your important criteria in the file name or directory structure, you can use path patterns with wildcards to include a specific subset of tasks. (...) Maybe we can also extend some of the task filtering that is available.

Being able to add a list of "tags" to the properties would be super cool and non-intrusive for this! This would not be visible by anything except the filtering mechanism. This would allow effectively arbitrary filtering on the concrete instances (by putting whatever property you are interested in as a tag) while being in a canonical place. (Proposal) Filtering can then be done either in tasks definition or additionally on the command line (with --require-tags / --exclude-tags).

If it is still desired, e.g., because the set of task-definition files is too huge to generate.

I think it is fine, given the above. It also really makes sense to have expected verdicts in a tool-independent location.

I see now the two different perspectives generic tasks and task-definition have (i.e. one is not meant as a replacement for the other, they serve different purposes).

First, everything from your host is visible in the container unless you explicitly hide it (with the exception of /tmp)

Oh I was assuming it is opt-in, sorry! Also makes more sense, given that the "container" doesn't come with system libraries etc. Then forget these points.

One neat point would still be to be able to write files specifically for the execution, but this then can be done by a wrapper script.

No, each XY is represented as one file.

Aha, I understand. This is where a lot of confusion came from. For me property_file is "a file containing a bunch of properties" (this interpretation of the term "Property File" is more or less universal for probabilistic verification, simply because the de-factor standard modelling language calls it that). Here it is "everything needed to fully describe a query". Then I totally agree that everything should be put into that file and this file is a clean "adapter" to the domain specific way of writing queries.

No. It is an addition to the expected verdict that is needed to determine whether the tool output is correct or wrong.

Wouldn't it be cleaner to then say expected_verdict: "False(xy)" instead of having a separate key? Given that it is domain-specific, just as a precision requirement. It would fit well with the idea of introducing different result interpreters, which would change the type of expected_verdict from bool to "anything" anyway.

It could even be

expected_verdict:
  value: False
  sub_property: xy

or something like that, if proper types are desired.

But it is really useful during research to open the HTML table and interactively click around, filter, define plots, etc.

Agree, but feels mainly useful for the intersection of "users that are technically adept enough to set up benchexec" and "users that are not fluent in data analysis" (which a not-so-small set!). That's why I did not consider it important for me personally.

(I personally usually follow the process of having a script that reads raw results and converts them to CSVs and then these CSVs are rendered to plots by TikZ / pgpflots. Then even the visuals match precisely with the paper.)

I think it is good if I track what I currently feel is still open, as above:

Can / should tool-info be able to indicate that certain concrete instances are not supported? (should come at basically no additional cost and easily backwards compatible)
I think sub_property should be changed
tags or similar for properties?

PhilippWendler commented 6 months ago

This is important if you have core limits, i.e., if you want the benchmark to run on fewer cores than your machine has, and especially if you have parallel runs.

Oh! Yes I definitely need that (and practically any evaluation should do this IMO). But the doc says there is --memlimit and --cores also for runexec? If that isn't possible with runexec / RunExec API then it is out of the question.

Memory limit is possible, and restricting to a given set of cores as well (this is what --cores) does. But something needs to decide which cores to use, and that is only implemented in benchexec. So with benchexec you can say "use 8 cores", with runexec you need to say "use cores 0,2,4,6,8,10,12,14". The reason why this is in benchexec is that this feature is especially relevant if more than one run is executed in parallel, which is something only benchexec knows about.

I have the luxury of not needing parallel runs, since we have few enough so I don't need this level of management. I see that allocating resources for parallel executions is completely different and even more involved. I "just" want single-core + memory limits, clean isolation (e.g. "reserving" one physical core and pinning the process to one logical core), and resource usage like peak memory.

If you want a single core, you can pick one manually and pass it to runexec. As long as nothing else on the machine uses core pinning (and to ensure this is your responsibility with both benchexec and runexec because both know nothing about other processes!), every core is the same (if you don't have one of these Intel/Apple CPUs with different kinds of cores). runexec can do pinning.

Also note that neither benchexec nor runexec can do "reserving". We rely on the OS scheduler to use mostly the other cores for other processes if it sees some cores being loaded by pinned applications. The reason is that core reservation wasn't feasible at all to do in the past, and only has become somewhat possible with cgroups v2. But first the system administrator would be required to reserve some cores for the respective user, and then BenchExec would pick some of the reserved cores and reserve it for itself. This would be doable but is not implemented.

For example, if you represent some of your important criteria in the file name or directory structure, you can use path patterns with wildcards to include a specific subset of tasks. (...) Maybe we can also extend some of the task filtering that is available.

Being able to add a list of "tags" to the properties would be super cool and non-intrusive for this! This would not be visible by anything except the filtering mechanism. This would allow effectively arbitrary filtering on the concrete instances (by putting whatever property you are interested in as a tag) while being in a canonical place. (Proposal) Filtering can then be done either in tasks definition or additionally on the command line (with --require-tags / --exclude-tags).

Would be possible. Still a lot remains unclear, e.g., how to attach labels to property files, or whether this should be a more generic task filter or restricted to properties, and this would also be a good case for a separate issue. Precedent is #529, which got resolved by a filter on the expected verdict.

No. It is an addition to the expected verdict that is needed to determine whether the tool output is correct or wrong.

Wouldn't it be cleaner to then say expected_verdict: "False(xy)" instead of having a separate key? Given that it is domain-specific, just as a precision requirement. It would fit well with the idea of introducing different result interpreters, which would change the type of expected_verdict from bool to "anything" anyway.

Are you aware how funny this is? ;-) Cf. https://github.com/sosy-lab/benchexec/issues/994#issuecomment-1952188814 "Definitely in favor of additional keys (in general, I am against magic inside strings :) )"

Yes, expected_verdict could be defined differently in a number of ways, but of course we only change it to support additional features, not only because of the style.

But it is really useful during research to open the HTML table and interactively click around, filter, define plots, etc.

Agree, but feels mainly useful for the intersection of "users that are technically adept enough to set up benchexec" and "users that are not fluent in data analysis" (which a not-so-small set!). That's why I did not consider it important for me personally.

(I personally usually follow the process of having a script that reads raw results and converts them to CSVs and then these CSVs are rendered to plots by TikZ / pgpflots. Then even the visuals match precisely with the paper.)

Sure, for the plots in the paper we do the same (and everyone should, we even provide scripts and templates). But writing pgfplots code for every plot that you would want to get a quick look at even before you have started to think about which plots should be in the paper is something that I definitively don't want to do. We also use the interactive tables in meetings with coauthors and even in internal presentations of ongoing work, and the audience often asks "can you please show XY" during the talk or the questions phase because everyone recognizes BenchExec tables and knows what they can do with a few clicks. :-)

I think it is good if I track what I currently feel is still open, as above:

I don't know what you consider resolved. For me #996 and #997 are open. The current issue is too "full" already and can serve for discussion, not as an action item for implementation. Every concrete request needs to get its own issue.

Can / should tool-info be able to indicate that certain concrete instances are not supported? (should come at basically no additional cost and easily backwards compatible)

Easy, but I am unsure whether it makes sense conceptually. The tool-info module is not intended to be configuration of the benchmarking experiment, its goal is to be a (reusable and sharable) description of how to run the tool.

I think sub_property should be changed

Cf. above, doing this purely for style would just be hassle for users.

tags or similar for properties?

Cf. above.

incaseoftrouble commented 6 months ago

this would also be a good case for a separate issue.

yes, I just wanted to ask whether it is a sensible thing to talk about. Details go into a separate issue.

Are you aware how funny this is? ;-)

yes and no, these points are subtly different :)

Yes, expected_verdict could be defined differently in a number of ways, but of course we only change it to support additional features, not only because of the style.

Agree, but since the format of task definitions has to be changed anyway if one wants to support approximate verdicts, I see no reason why this shouldn't be "cleaned up" / unified in that same step. I don't want to change this solely for this matter.

I find it better, style wise, to have everything related to defining the expected verdict in one place :) Having a one domain specific key on the top level, but others on a different level feels weird and is confusing, from a user perspective. When I'm reading the definition of the task-definition.yml this was by far the most confusing point. It is not apparent from the format that this is purely related to validating the expected result.

Cf. above, doing this purely for style would just be hassle for users.

Users don't need to change anything, since format_version: 2.0 does keep things this way.

But I don't have a too strong opinion here. I just think if this the format is touched anyway, why not make it nice.

plots

I'm not arguing that the plots of benchexec aren't useful!

I don't know what you consider resolved.

Things where I realize I was wrong or we agree that this is something concrete to be discussed.

Easy, but I am unsure whether it makes sense conceptually. The tool-info module is not intended to be configuration of the benchmarking experiment, its goal is to be a (reusable and sharable) description of how to run the tool.

And, I would argue, a description of the tools capabilities!

Example: The models I work with can have many different classes (DTMC, CTMC, MDP, CTMDP, SMG, ...) and properties can be of even more types. Some tools only support properties which are written in a specific format (even if it is of the same type) or with a particular configuration. It is possible but annoying to encode this in the filename (since every demand of every tool has to be anticipated when naming the files). Just for an example, this is a literal excerpt of the filtering code I used:

    def is_supported(self, model: Model, query: Query) -> bool:
        if self.engine == "expl": # Tool specific option
            return isinstance(query, Reachability) and model.model_type in {
                ModelType.MDP,
                ModelType.DTMC,
            }
        if self.engine == "hybrid":
            return isinstance(query, Reachability)
        if isinstance(query, MeanPayoff):
            # For this tool, reward reference must be a string
            return isinstance(query.rewards, str) and model.model_type in {
                ModelType.MDP,
                ModelType.DTMC,
            }
        if isinstance(query, Reachability):
            return model.model_type != ModelType.SMG
        return False

This filters out more than half of the executions I would run. Example: Some tools only are run on ~12 out of >100 instances.

Basically what I want is equivalent to running the tool and the giving the result ERROR(unsupported), just saving the hassle of running the tool. In other words, tools can "give up" early.

EDIT: Also, checking if the execution actually terminates or fails with an error is not really helpful. Some tools are extensions of another for new property / model types, so to evaluate that tool we only want to run it on the new stuff (even if it supports the old ones by delegating to the base tool's codebase)

One could solve it by tagging (e.g. "toolx-supported" "tooly-supported"), but it would IMO not be nice to have tool-specific information in the task definitions.

PhilippWendler commented 6 months ago

Easy, but I am unsure whether it makes sense conceptually. The tool-info module is not intended to be configuration of the benchmarking experiment, its goal is to be a (reusable and sharable) description of how to run the tool.

And, I would argue, a description of the tools capabilities! [...] Basically what I want is equivalent to running the tool and the giving the result ERROR(unsupported), just saving the hassle of running the tool. In other words, tools can "give up" early.

I understand the use case. And we do have something like this already, but more related to the "shape" of the tasks instead of the content: For example, tool-info modules for tools that do not support more than one input file will throw UnsupportedFeatureException if given a task with > 1 input file, and similarly if a property file is required by the tool but missing. These are cases where the tool-info module really cannot build a command line.

For cases where "just" the property is unsupported (or the input language or whatever) I recommend to not hard-code this in the tool-info module, because it makes it more difficult to experiment with future versions of the tool that might support more features. Consider for example a developer working on a fork aiming at adding one more supported property format. They would be glad to be able to use BenchExec with the existing tool-info module.

Another argument to consider here is that we always have to be careful when we just skip something in unwanted situations, instead of making BenchExec fail (the latter happens right now on UnsupportedFeatureException). The former always has the risk of users accidentally benchmarking something different (e.g., less) than what they think. A requirement to explicitly specify the exact set of desired tasks, and produce an error if the tool does not support all of them, is more along the lines of "robust and reliable benchmarking". This is something that some BenchExec users strongly desire.

So, in principle we could add something like --ignore-unsupported-feature or so that makes BenchExec just skip a run if the tool-info module throws the exception. But given the above two arguments of generality of tool-info modules and robustness, I am not strongly in favor.

EDIT: Also, checking if the execution actually terminates or fails with an error is not really helpful. Some tools are extensions of another for new property / model types, so to evaluate that tool we only want to run it on the new stuff (even if it supports the old ones by delegating to the base tool's codebase)

But such a case you would certainly have to resolve by structuring your benchmark definitions and tasks appropriately. Refusing some tasks in the tool-info module although the tool would support them is certainly not recommended.

One could solve it by tagging (e.g. "toolx-supported" "tooly-supported"), but it would IMO not be nice to have tool-specific information in the task definitions.

Yes.

incaseoftrouble commented 6 months ago

These are cases where the tool-info module really cannot build a command line.

Yes, and this might be the case already if the property is even supported but "just" in the wrong format. This needs to be hard-coded in some way because I need to translate to the tool-specific way of asking for a property anyway. If someone would extend the tool to support the new property, they would fundamentally need to modify the existing tool info. (I would argue there is, conceptually, not much difference between "shape" and "type".)

They would be glad to be able to use BenchExec with the existing tool-info module.

I see the point, but modifying the existing tool-info would be reasonable in this case.

I see the overall point, but I think for usability either the tool info has to be able to filter or a powerful filtering mechanism is needed in cases where the "compatibility matrix" is really involved. Filtering by file name / wildcard is not really nice, since, as I understand it, the idea would be to have all properties related to a system should be in one file (not one system-property combination per task-def) but each tool might only support a subset of the properties.

I can see adding all the filtering-relevant properties into a set of tags and write benchmark definitions that are something like "everything in this folder that has tags x y z". This however shifts the burden on writing the task definition files (whenever a new tool comes around that has a new compatibility dimension, every task-def needs to re-written).

There is one point for explicitly treating this differently than filtering: If I can say "run every tool on every benchmark" and tools can explicitly report that they do not support something, I can see in the result tables how many things each tool did support (different from "I filtered these out beforehand").

PhilippWendler commented 6 months ago

These are cases where the tool-info module really cannot build a command line.

Yes, and this might be the case already if the property is even supported but "just" in the wrong format. This needs to be hard-coded in some way because I need to translate to the tool-specific way of asking for a property anyway. If someone would extend the tool to support the new property, they would fundamentally need to modify the existing tool info. (I would argue there is, conceptually, not much difference between "shape" and "type".)

In the case where the tool-info module cannot know what command line it should produce, it simply needs to throw UnsupportedFeatureException, of course. But there are many cases where it can do so, even if the result is a command line that will just make the tool fail - but a newer version of the tool could potentially succeed. Similar case exists for command-line parameters: Users can specify arbitrary command-line parameters in the benchmark definition. The tool-info module should not check against an internal list whether they are all valid, it should just put them into the command line. And if the tool accepts all formats of inputs and properties in the same way on the command line, then the tool-info module also does not need to bother (and should not) with distinguishing them.

They would be glad to be able to use BenchExec with the existing tool-info module.

I see the point, but modifying the existing tool-info would be reasonable in this case.

I don't think so. Consider you being a maintainer of tool X and some other developer Y working on feature Z. Would you be in favor of the tool-info module in BenchExec being changed according to the wishes/needs of Y for Z, even if one still doesn't know whether the code from Y will get merged into X?

So a tool-info module should not impose restrictions that are not technically necessary.

For the rest of your comment, it seems like you are talking mostly about filtering? Let's please keep that discussion in #998.

There is one point for explicitly treating this differently than filtering: If I can say "run every tool on every benchmark" and tools can explicitly report that they do not support something, I can see in the result tables how many things each tool did support (different from "I filtered these out beforehand").

Sure, that sounds like a good thing, right? And tools of course can already explicitly report that they do not support something - they just have to output a proper error message (which they should do anyway for the sake of their users) and the tool-info module has to transform that into a useful status value like ERROR(unsupported property foo) or so.

What we discuss is just optimizing away those trivial runs. So it would actually be nice to have numbers: A concrete use case, with X runs in total, out of which Y one would want to filter out upfront, and where those Y tasks cost Z amount of resources to execute.

incaseoftrouble commented 6 months ago

Similar case exists for command-line parameters: Users can specify arbitrary command-line parameters in the benchmark definition. The tool-info module should not check against an internal list whether they are all valid, it should just put them into the command line.

Fair point!

Would you be in favor of the tool-info module in BenchExec being changed according to the wishes/needs of Y for Z, even if one still doesn't know whether the code from Y will get merged into X?

Oh, no, absolutely not, but I would also guess that most tool-info won't get merged into the benchexec distribution. I would even say it shouldn't for most artifact evaluations. I think it is sensible if tools provide their tool-info in their repo and only have tool-info of major, stable tools in benchexec. I am mainly seeing this from an artifact evaluation perspective, not a recurring competition. But anyway, I do see the argument.

output a proper error message (which they should do anyway for the sake of their users)

That would be nice, yes :D

What we discuss is just optimizing away those trivial runs.

Exactly! These can be very substantial (potentially close to 80% of the executions in particular setups)

However, I see the argument in the context of wrong tool_command (or however it is called) and things that are unsupported by the tool. It probably is best to handle this via filtering, so I move what remains to #998

I think that concludes the broad discussion. Thanks a lot for taking your time!

incaseoftrouble commented 6 months ago

Oh, one more point: I just realized that one task is one model with several properties, so one task-definition is exactly one task. This seems counter-intuitive to me. Shouldn't different model-property definitions be considered independent (at least this is how it would be for me, the different properties we are interested in for a give model often are completely unrelated.)

I'm mainly curious about the reasoning why this is the case. This feels unnatural.

PhilippWendler commented 6 months ago

Would you be in favor of the tool-info module in BenchExec being changed according to the wishes/needs of Y for Z, even if one still doesn't know whether the code from Y will get merged into X?

Oh, no, absolutely not, but I would also guess that most tool-info won't get merged into the benchexec distribution. I would even say it shouldn't for most artifact evaluations. I think it is sensible if tools provide their tool-info in their repo and only have tool-info of major, stable tools in benchexec. I am mainly seeing this from an artifact evaluation perspective, not a recurring competition. But anyway, I do see the argument.

It is explicitly a feature of BenchExec to work with tool-info modules that are defined outside of its own code. But we do want to encourage people to submit tool-info modules to BenchExec for inclusion, and the bars for this are low. The reason for this is that we are just as much in favor of people using BenchExec for all kinds of experiments, and the easier it is the more people will do it. We also have tool-info modules written by third parties (not the original tool developers), so if you perform a comparison against the state of the art for some paper, we invite you to submit your tool-info modules, as long as they do not hard-code specifics of your experiment.

Competitions are of course a strong argument for having a public tool-info module, and btw. we like it very much if competitions make having a merged tool-info module a participation requirement. We also strive to help competitions in this regard by timely reviewing of PRs with such modules (which often catches errors that would have otherwise only be found during pre-runs) and by coordinating releases with the competition schedule (such that a competition can point to an official release of BenchExec and declare that this is the state with which the competition was executed).

I think that concludes the broad discussion. Thanks a lot for taking your time!

You are welcome! We are always glad when more people adopt BenchExec and try to help! Especially for competitions we are interested in them using BenchExec, because they are a great way to spread this best practice.

Oh, one more point: I just realized that one task is one model with several properties, so one task-definition is exactly one task. This seems counter-intuitive to me. Shouldn't different model-property definitions be considered independent (at least this is how it would be for me, the different properties we are interested in for a give model often are completely unrelated.)

We consider one "task" to be the tuple "(input files, property, expected verdict)" (the latter two parts are optional). For the combination of one task and one tool one "run" is created, i.e., one concrete execution of the tool. One task-definition file can thus define several tasks.

I realized that our glossary did not precisely define this, but I added it now.