madiapgar / diet_mouse_cdiff

0 stars 2 forks source link

Workflow code review #7

Open sterrettJD opened 2 months ago

sterrettJD commented 2 months ago

Hey Madi,

Nice work on this. Some things I'd recommend:

  1. When creating the output paths in your snakefile, I'd recommend using os.path.join() rather than + on the strings, as that would be more robust (especially on different systems).

  2. This is me trying to think of how you could avoid having as many global variables that take up >100 lines, but you could also wrap that up into a function. For example, rather than doing this many times:

metab_lm = DATASET_DIR + "stats/metab_linear_model.tsv"
metab_dunn = DATASET_DIR + "stats/metab_dunn_test.tsv"
metab_kruskal = DATASET_DIR + "stats/metab_kruskal_test.tsv"

you could instead use a list comprehension with that function, which would look like this

def comb_filepaths(DATASET_DIR, extra_path):
    # very simple example of how this could work
    return os.path.join(DATASET_DIR, extra_path)

outputs = ["stats/metab_linear_model.tsv", "stats/metab_dunn_test.tsv", others_read_in_from_file]
outputs_comb = [comb_filepaths(DATASET_DIR, filepath) for filepath in outputs]
  1. Low priority but if you're thinking about generalizability... Can you set up a parameter in your config file for the QIIME2 conda environment? That way the conda environment could be updated if need be. Also I know there are weird things with

  2. In envs/, could you give run.sh and run_alpine.sh more informative names, or create a README.md in that subdirectory explaining their use/function?

  3. Could you add an example in the main README.md file of what all steps someone needs to take to rerun your workflow? Try to walk them through it in less than 5 steps. For example, "download the data with this command, create your conda environments with this command, change anything needed in the config file, and run the workflow with this command". You could also put this in its own separate markdown file.

  4. Right now it looks like things are exclusively set up for the cecal dataset. For the paper, I would recommend having an example config.yml as it is now, as well as a directory that contains all config files used for the analyses used in the paper. Rather than needing to modify your config file, a user should be able to just run the workflow with existing config files to recreate the results used in the paper.

I think what you have here is great, so really nice job. My recommendations are mostly focused on cleaning up the code a little, and making sure there are clearly explained steps to reproduce your analysis.

madiapgar commented 3 weeks ago

Thank you! As of right now, I've gone through and addressed bullet points 1-3 and part of 6 (I'm mostly doing this so I can keep track of what I've done so far).