Conditional analysis newfile with in-job looping

Method ooh_its_a_common_conditional_analysis_loop tries to implement the common variant version of what you want to do:

creates a python job in the batch
for each gene-celltype combination it makes a new call inside that python job (analogous to creating one bash job and running multiple commands inside)
for each python job call it:

creates the command string for the Step 2 analysis, and runs it using subprocess.
if that succeeds it copies the result into GCP
it also feeds that result into the Step 3 command
if that succeeds it copies the result into GCP
I haven't nailed this bit yet... here it would read the result as a dataframe, decide whether it needed to add a new condition
- if there are more significant SNPs, adda new entry to conditions, increments the round number, and restarts
- if there are not, it takes the latest results, and copies them into GCP as <path>..._final_conditional_results

I've assumed that you want to start the analysis unconditioned? Either way, the python_job call will accept a list of conditions, or None.

You might want to change all the naming conventions. IDK.

I've no idea if this model works - it is what I was thinking, and AFAIK this is the only way to run two separate R scripts, then make a code-decision on whether to re-run them an unknown number of times.

I've left your original methods for step 2 and 3, just commented out. I'm hoping that the one method contains all that functionality

Method ooh_its_a_common_conditional_analysis_loop tries to implement the common variant version of what you want to do:

creates a python job in the batch

for each gene-celltype combination it makes a new call inside that python job (analogous to creating one bash job and running multiple commands inside)

for each python job call it:

creates the command string for the Step 2 analysis, and runs it using subprocess.

if that succeeds it copies the result into GCP

it also feeds that result into the Step 3 command

if that succeeds it copies the result into GCP

I haven't nailed this bit yet... here it would read the result as a dataframe, decide whether it needed to add a new condition

if there are more significant SNPs, adda new entry to conditions, increments the round number, and restarts

if there are not, it takes the latest results, and copies them into GCP as <path>..._final_conditional_results

I've assumed that you want to start the analysis unconditioned? Either way, the python_job call will accept a list of conditions, or None.

You might want to change all the naming conventions. IDK.

I've no idea if this model works - it is what I was thinking, and AFAIK this is the only way to run two separate R scripts, then make a code-decision on whether to re-run them an unknown number of times.

I've left your original methods for step 2 and 3, just commented out. I'm hoping that the one method contains all that functionality

I haven't looked at the code yet but just a couple of comment:

No I don't really plan to do the analysis unconditioned, that is what already happens in saige_assoc.py, which runs step 1 and then steps 2 and 3 the first time around. That said, as long as this script has a check to see if the inputs exist before re-running, and as long as the filename matches with the outputs of the OG pipeline it doesn't really matter?
there is no final final file that needs to be written, all I need is for each set of results to be written out?
thinking about it again the ideal would actually be to run one round (all genes) at a time, so that a) each round's results could be summarised and saved, and b) the decision on whether to go for the next round is based on a genome-wide significance threshold, rather than gene-specific

populationgenomics / saige-tenk10k

Conditional analysis newfile with in-job looping #167