Initial Snakemake workflow defition from benchmark yaml file

DanInci commented 6 months ago

Mandatory:

[x] Include LinkML Omni-Schema
[x] Dynamically generate rules
[x] Link inputs to outputs
[x] Nest multiple method runs, due to multiple input values
[x] Manage exclusions
[x] Hash params and runs

Nice to have:

~[x] Allow paths to be colapsed if multiple parameters / runs do not exist for module~
[ ] Write multiple benchmark examples to verify integrity of the generated Snakefile
[ ] Logic validation for benchmark.yaml file?

imallona commented 6 months ago

Params are hashed now, e.g. out/data/D1/process/P1/param_0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"] ?) to translate param_0? shouldn't that be in a plain text file inside out/data/D1/process/P1/ ? @DanInci

DanInci commented 6 months ago

Params are hashed now, e.g. out/data/D1/process/P1/_param0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"] ?) to translate param_0? shouldn't that be in a plain text file inside out/data/D1/process/P1/ ? @DanInci

@imallona I think that would be harder to implement. But now, all parameters are managed by snakemake and are accessible inside each rule. For instance:

print('  params are', snakemake.params)

An easy thing for browsability would be to output them in a parameters.txt file inside the out/data/D1/process/P1/param_0 folder.

imallona commented 5 months ago

I like the "_output them in a parameters.txt file inside the out/data/D1/process/P1/param0" idea a lot, that would help browsing, indeed

imallona commented 5 months ago

Can you please add a get_module_repository() functionality (via converter?) returning the YAML repo string for a given module_id? I'm afraid I don't feel the converter logic intuitive and failed to do it (spent 10 min). Thanks

DanInci commented 5 months ago

@imallona done

imallona commented 5 months ago

Three suggestions/requests:

One. Could we remove run folders and run_ids completely from the codebase? They're bound to break the "implicit" provenance, as discussed earlier this week. We discussed the advantages of keeping the "stage, module, params" triples (nested folders) as "benchmarking units". Without run folders.

Two (linked to the run_id removal). Could you then modify the pre and post parsing so the post is always the most nested, deepest pre? To distinguish between (params removed for the sake of clarity) out/data/D1/process/P1/method/M1/M1.res from out/data/D1/method/M1/M1.res when some stages can be skipped?

Three. Parameter combinations are hashed into params_0, params_1 folder names. Currently (f4bd14c7e9823393e61d741dcfe2ebff38a36d07) we are generating a parameters.txt (hash translation) file under each hashed param folder, .e.g. at

cat out/data/D1/default/run/process/P2/param_1/run/parameters.txt
-a 1
-c 0.1

Hence having as many params.txt files as hashes they are. I wonder whether you @DanInci and @almutlue would also consider helpful to generate an aggregated parameter file just inside the module folder, e.g. at data/D1/default/run/process/P1/parameter_mapping.txt, reporting (tab separated) all tested param_hash:meaning pairs:

folder_name parameter_meaning
param_0 "-d1 -e 1"
param_1 "-d1 -e 2"

I certainly would prefer the aggregated parameter_mapping.txt file directly inside the module folder.

Thanks in advance!

DanInci commented 5 months ago

@imallona @almutlue All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?

DanInci commented 5 months ago

Found a corner case in Peiying's tests, that breaks the implicit provenance rule.

To exemplify, let's say there are 4 stages named data, normalization, select_lognorm and dimred. They are all supposed to be successive one after the other : data -> normalization -> select_lognorm -> dimred.

They output the following files:

data: data
normalization: data.normalized
select_lognorm: feature.selected
dimred: pca.matrix

And they required the following input files:

normalization: data
select_lognorm: data
dimred: feature.selected, data.normalized

The problem is that successive stages require the initial dataset, while a later stage (dimred) requires both inputs from the previous stages. In terms of provenance, two different paths are merging.

Easy solution would be to make the benchmarker require inputs from previous stages if he knows they will be needed in downstream stages, e.g. required input for the select_lognorm would be data, data.normalized

Another solution would be to implement some lookahead functionality when compiling the snakemake workflow and implicitly link the required inputs if they are needed in downstream steages.

How to deal with situations like this? EDIT: Not an actual problem. Contributor / Benchmarkers are expected to deal with this.

imallona commented 5 months ago

All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?

I'd suggest merging to main @DanInci already, and start cleaning up old repos/discontinued omnibenchmark

omnibenchmark / omni_workflow

Initial Snakemake workflow defition from benchmark yaml file #6