pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.3k stars 3.65k forks source link

[Roadmap] Replace GraphGym with Snakemake workflow #8328

Open Sann5 opened 1 year ago

Sann5 commented 1 year ago

🛠 Proposed Refactor

GraphGym's implementation is unnecessarily complicated and poorly documented. I'd like to replace it with a Snakemake workflow. This will allow users more flexibility in training and comparing different models for different datasets while increasing transparency.

Additional benefits would include:

Related issues: #5132 #6475 #6464 #6416

Suggest a potential alternative/fix

Vision

The idea is to create a workflow that can be run like any other workflow from the Snakemake workflow catalog. Therefore the usage would look something like....

1. Create env

mamba create -c conda-forge -c bioconda --name pyg snakemake snakedeploy
conda activate pyg

2. Make a directory for the project

mkdir -p path/to/project-workdir
cd path/to/project-workdir
snakedeploy deploy-workflow https://github.com/pyg-team/pyg_model_selection . --tag v2.0.0

Snakedeploy will create two folders workflow and config. The former contains the deployment of the chosen workflow as a Snakemake module, the latter contains configuration files which will be modified in the next step to configure the workflow to your needs.

3. Configure workflow

This is where you modify the configuration files to suit your specific use case.

4. Run workflow

snakemake --cores all --use-conda 

5. Visualize results

Generate a report.

Roadmap:

akihironitta commented 11 months ago

I completely agree that GraphGym is poorly documented and may look complicated. I am unfamiliar with and am not against Snakemake at all, but I just wanted to note that PyTorch Lightning already supports all the points mentioned above.

If this could be done completely independently of the PyG repo, I'd work on this on my own and would showcase it in the PyG discussion (e.g., #7935) if I were you for now :)

Sann5 commented 11 months ago

It can be done completely independently of the PyG repo, so let's do it that way. Because I'm rather busy at the moment I probably won't have anything to show until the end of the year.

From quickly skimming in the PyTorch Lightning documentation it appears that you are right; PyTorch Lightning seems to offer functionality that facilitates integration with MLFlow, YAML config files, and cluster execution. That said, and correct me if I'm wrong, it does not offer a workflow management functionality. So what we could seek to create is a workflow that leverages the functionalities offered by PyTorch Lightning but in addition to this allows users to:

I'm just not familiar with PyTorch Lightning so It would take some learning. What I have implemented already is using normal PyG, PyTorch, and Snakemake. I'm also open to using a different workflow management system (instead of Snakemake) like NextFlow if there is a reason to do so.