Describe the bug
Running the L2G training step and logging the results to the W&B dashboard takes ~2 hours.
Observed behaviour
The changes in https://github.com/opentargets/gentropy/pull/544 removed some data caching steps to avoid memory issues.
This has had an impact in the experiment logging. I ran L2G in the development machine (single node) and took over 2h.
I didn't follow the process in the Spark UI, but I did notice:
Training took ~1h
After training was complete, evaluating the model triggered another training step because the process was not checkpointed.
Expected behaviour
If training takes ~30 minutes when we run the step from Airflow (and without model evaluation), the process with the evaluation part should take a similar amount of time.
To Reproduce
Steps to reproduce the behaviour:
Create dev environment make create-dev-cluster
Tweak configuration ot_locus_to_gene_train.yaml and set wandb_run_name
Run step gentropy --config-dir="/config" --config-name="ot_config.yaml" step=ot_locus_to_gene_train
Describe the bug Running the L2G training step and logging the results to the W&B dashboard takes ~2 hours.
Observed behaviour The changes in https://github.com/opentargets/gentropy/pull/544 removed some data caching steps to avoid memory issues. This has had an impact in the experiment logging. I ran L2G in the development machine (single node) and took over 2h.
I didn't follow the process in the Spark UI, but I did notice:
Expected behaviour If training takes ~30 minutes when we run the step from Airflow (and without model evaluation), the process with the evaluation part should take a similar amount of time.
To Reproduce Steps to reproduce the behaviour:
make create-dev-cluster
ot_locus_to_gene_train.yaml
and setwandb_run_name
gentropy --config-dir="/config" --config-name="ot_config.yaml" step=ot_locus_to_gene_train