omnibenchmark / omni_workflow

GNU General Public License v3.0
0 stars 0 forks source link

Initial Snakemake workflow defition from benchmark yaml file #6

Closed DanInci closed 5 months ago

DanInci commented 6 months ago

Mandatory:

Nice to have:

imallona commented 6 months ago

Params are hashed now, e.g. out/data/D1/process/P1/param_0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"] ?) to translate param_0? shouldn't that be in a plain text file inside out/data/D1/process/P1/ ? @DanInci

DanInci commented 6 months ago

Params are hashed now, e.g. out/data/D1/process/P1/_param0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"] ?) to translate param_0? shouldn't that be in a plain text file inside out/data/D1/process/P1/ ? @DanInci

@imallona I think that would be harder to implement. But now, all parameters are managed by snakemake and are accessible inside each rule. For instance:

print('  params are', snakemake.params)

An easy thing for browsability would be to output them in a parameters.txt file inside the out/data/D1/process/P1/param_0 folder.

imallona commented 5 months ago

I like the "_output them in a parameters.txt file inside the out/data/D1/process/P1/param0" idea a lot, that would help browsing, indeed

imallona commented 5 months ago

Can you please add a get_module_repository() functionality (via converter?) returning the YAML repo string for a given module_id? I'm afraid I don't feel the converter logic intuitive and failed to do it (spent 10 min). Thanks

DanInci commented 5 months ago

@imallona done

imallona commented 5 months ago

Three suggestions/requests:

One. Could we remove run folders and run_ids completely from the codebase? They're bound to break the "implicit" provenance, as discussed earlier this week. We discussed the advantages of keeping the "stage, module, params" triples (nested folders) as "benchmarking units". Without run folders.

Two (linked to the run_id removal). Could you then modify the pre and post parsing so the post is always the most nested, deepest pre? To distinguish between (params removed for the sake of clarity) out/data/D1/process/P1/method/M1/M1.res from out/data/D1/method/M1/M1.res when some stages can be skipped?

Three. Parameter combinations are hashed into params_0, params_1 folder names. Currently (f4bd14c7e9823393e61d741dcfe2ebff38a36d07) we are generating a parameters.txt (hash translation) file under each hashed param folder, .e.g. at

cat out/data/D1/default/run/process/P2/param_1/run/parameters.txt
-a 1
-c 0.1

Hence having as many params.txt files as hashes they are. I wonder whether you @DanInci and @almutlue would also consider helpful to generate an aggregated parameter file just inside the module folder, e.g. at data/D1/default/run/process/P1/parameter_mapping.txt, reporting (tab separated) all tested param_hash:meaning pairs:

folder_name parameter_meaning
param_0 "-d1 -e 1"
param_1 "-d1 -e 2"

I certainly would prefer the aggregated parameter_mapping.txt file directly inside the module folder.

Thanks in advance!

DanInci commented 5 months ago

@imallona @almutlue All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?

DanInci commented 5 months ago

Found a corner case in Peiying's tests, that breaks the implicit provenance rule.

To exemplify, let's say there are 4 stages named data, normalization, select_lognorm and dimred. They are all supposed to be successive one after the other : data -> normalization -> select_lognorm -> dimred.

They output the following files:

And they required the following input files:

The problem is that successive stages require the initial dataset, while a later stage (dimred) requires both inputs from the previous stages. In terms of provenance, two different paths are merging.

Easy solution would be to make the benchmarker require inputs from previous stages if he knows they will be needed in downstream stages, e.g. required input for the select_lognorm would be data, data.normalized

Another solution would be to implement some lookahead functionality when compiling the snakemake workflow and implicitly link the required inputs if they are needed in downstream steages.

How to deal with situations like this? EDIT: Not an actual problem. Contributor / Benchmarkers are expected to deal with this.

imallona commented 5 months ago

All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?

I'd suggest merging to main @DanInci already, and start cleaning up old repos/discontinued omnibenchmark