Closed DanInci closed 5 months ago
Params are hashed now, e.g. out/data/D1/process/P1/
param_0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"]
?) to translate param_0
? shouldn't that be in a plain text file inside out/data/D1/process/P1/
? @DanInci
Params are hashed now, e.g.
out/data/D1/process/P1/
_param0. Where can we see the actual parameter mapping (["-a 0", "-b 0.1"]
?) to translateparam_0
? shouldn't that be in a plain text file insideout/data/D1/process/P1/
? @DanInci
@imallona I think that would be harder to implement. But now, all parameters are managed by snakemake and are accessible inside each rule. For instance:
print(' params are', snakemake.params)
An easy thing for browsability would be to output them in a parameters.txt
file inside the out/data/D1/process/P1/param_0
folder.
I like the "_output them in a parameters.txt
file inside the out/data/D1/process/P1/param0" idea a lot, that would help browsing, indeed
Can you please add a get_module_repository()
functionality (via converter?) returning the YAML repo
string for a given module_id
? I'm afraid I don't feel the converter logic intuitive and failed to do it (spent 10 min). Thanks
@imallona done
Three suggestions/requests:
One. Could we remove run
folders and run_id
s completely from the codebase? They're bound to break the "implicit" provenance, as discussed earlier this week. We discussed the advantages of keeping the "stage, module, params" triples (nested folders) as "benchmarking units". Without run folders.
Two (linked to the run_id
removal). Could you then modify the pre
and post
parsing so the post
is always the most nested, deepest pre
? To distinguish between (params removed for the sake of clarity) out/data/D1/process/P1/method/M1/M1.res
from out/data/D1/method/M1/M1.res
when some stages can be skipped?
Three. Parameter combinations are hashed into params_0
, params_1
folder names. Currently (f4bd14c7e9823393e61d741dcfe2ebff38a36d07) we are generating a parameters.txt
(hash translation) file under each hashed param
folder, .e.g. at
cat out/data/D1/default/run/process/P2/param_1/run/parameters.txt
-a 1
-c 0.1
Hence having as many params.txt files as hashes they are. I wonder whether you @DanInci and @almutlue would also consider helpful to generate an aggregated parameter file just inside the module
folder, e.g. at data/D1/default/run/process/P1/parameter_mapping.txt
, reporting (tab separated) all tested param_hash:meaning
pairs:
folder_name parameter_meaning
param_0 "-d1 -e 1"
param_1 "-d1 -e 2"
I certainly would prefer the aggregated parameter_mapping.txt
file directly inside the module
folder.
Thanks in advance!
@imallona @almutlue All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?
Found a corner case in Peiying's tests, that breaks the implicit provenance rule.
To exemplify, let's say there are 4 stages named data
, normalization
, select_lognorm
and dimred
.
They are all supposed to be successive one after the other : data -> normalization -> select_lognorm -> dimred
.
They output the following files:
data
: datanormalization
: data.normalizedselect_lognorm
: feature.selecteddimred
: pca.matrixAnd they required the following input files:
normalization
: dataselect_lognorm
: datadimred
: feature.selected, data.normalizedThe problem is that successive stages require the initial dataset, while a later stage (dimred) requires both inputs from the previous stages. In terms of provenance, two different paths are merging.
Easy solution would be to make the benchmarker require inputs from previous stages if he knows they will be needed in downstream stages, e.g. required input for the select_lognorm
would be data, data.normalized
Another solution would be to implement some lookahead functionality when compiling the snakemake workflow and implicitly link the required inputs if they are needed in downstream steages.
How to deal with situations like this? EDIT: Not an actual problem. Contributor / Benchmarkers are expected to deal with this.
All previous concerns, and the ones discussed on Friday should be addressed. How should we proceed with this PR? Leave it open until we get some test feedback?
I'd suggest merging to main @DanInci already, and start cleaning up old repos/discontinued omnibenchmark
Mandatory:
Nice to have: