populationgenomics / saige-tenk10k

Hail batch pipeline to run SAIGE on CPG's GCP
MIT License
0 stars 0 forks source link

Conditional analysis newfile with in-job looping #167

Open MattWellie opened 3 days ago

MattWellie commented 3 days ago

Method ooh_its_a_common_conditional_analysis_loop tries to implement the common variant version of what you want to do:

  1. creates the command string for the Step 2 analysis, and runs it using subprocess.
  2. if that succeeds it copies the result into GCP
  3. it also feeds that result into the Step 3 command
  4. if that succeeds it copies the result into GCP
  5. I haven't nailed this bit yet... here it would read the result as a dataframe, decide whether it needed to add a new condition
    • if there are more significant SNPs, adda new entry to conditions, increments the round number, and restarts
    • if there are not, it takes the latest results, and copies them into GCP as <path>..._final_conditional_results

I've assumed that you want to start the analysis unconditioned? Either way, the python_job call will accept a list of conditions, or None.

You might want to change all the naming conventions. IDK.

I've no idea if this model works - it is what I was thinking, and AFAIK this is the only way to run two separate R scripts, then make a code-decision on whether to re-run them an unknown number of times.


I've left your original methods for step 2 and 3, just commented out. I'm hoping that the one method contains all that functionality

annacuomo commented 3 days ago

Method ooh_its_a_common_conditional_analysis_loop tries to implement the common variant version of what you want to do:

  • creates a python job in the batch
  • for each gene-celltype combination it makes a new call inside that python job (analogous to creating one bash job and running multiple commands inside)
  • for each python job call it:
  1. creates the command string for the Step 2 analysis, and runs it using subprocess.
  2. if that succeeds it copies the result into GCP
  3. it also feeds that result into the Step 3 command
  4. if that succeeds it copies the result into GCP
  5. I haven't nailed this bit yet... here it would read the result as a dataframe, decide whether it needed to add a new condition
  • if there are more significant SNPs, adda new entry to conditions, increments the round number, and restarts
  • if there are not, it takes the latest results, and copies them into GCP as <path>..._final_conditional_results

I've assumed that you want to start the analysis unconditioned? Either way, the python_job call will accept a list of conditions, or None.

You might want to change all the naming conventions. IDK.

I've no idea if this model works - it is what I was thinking, and AFAIK this is the only way to run two separate R scripts, then make a code-decision on whether to re-run them an unknown number of times.

I've left your original methods for step 2 and 3, just commented out. I'm hoping that the one method contains all that functionality

I haven't looked at the code yet but just a couple of comment: