Summary workflows: automate running multiple challenges and participants

ninsch3000 commented 2 years ago

Currently calculating the benchmarking metrics requires one summary workflow run per participant per challenge. For each of those runs the nextflow config has to be adapted for in- and output file- and challenge names, and individual input files have to be copied manually into the summary workflow directory structure.
This procedure is not feasible for (n participants) * (m challenges) = (many many).

Figure out concepts for (partial) automation.
Also revisit #230 Solving #400 might help as well

ninsch3000 commented 2 years ago

Thoughts on

Directory structure

APAEVAL
  |-- EVENT (Identification_01)
      |-- TOOL (tool1)
            |-- FILE1 (tool1_challengecode1_01.ext)
            |-- FILE2 (tool1_challengecode2_01.ext)
            ...
      |-- TOOL (tool2)
            |-- FILE1 (tool2_challengecode1_01.ext)
            ...
  |-- EVENT (AbsQuant_02)
      |-- TOOL (tool1)
            |-- FILE1 (tool1_challengecode1_02.ext)
            |-- FILE2 (tool1_challengecode2_02.ext)
            ...
      |-- TOOL (tool6)
            |-- FILE1 (tool6_challengecode1_02.ext)
            ...

@mrgazzara what do you think? This is pretty much what we have in our specifications, except that we would group by event instead of challenge on a higher level, and have the challengecode only in the filenames.
The reason I'm suggesting this order is that we have one SWF per event. For each run we then have to specify tool name, input file, and challenge id (which has to be the same as the ground truth file name without extension). If we solve #230, we can specify a list of challenges within one run (still per tool, that's why TOOL is my second level directory).
If this structure is not feasible for you, I guess any permutation of event-tool-challenge would do for the directory nesting, as long as it is consistent across all events, tools, and challenges.

It would also be good to name the ground truth files as CHALLENGECODE_GENOME.bed. The GENOME part is needed in the filename of the ground truth in order to select the correct annotation to work with. We just need @dominikburri to confirm that _ is working, or whether it has to be a . separator. If a ground truth file is used in multiple challenges we could maybe do a softlink solution to nevertheless have distinct filenames with all distinct challenge codes?

I believe with this rigorous naming scheme we should be set up nicely for automation.

ninsch3000 commented 2 years ago

Possibility for

Programmatic download fromGDrive

https://github.com/wkentaro/gdown

(First answer here worked for me.)

mrgazzara commented 2 years ago

@ninsch3000 Thanks for the nice suggestions here. I think your suggestions for the file names and directory structure makes a lot of sense. I will have to try out the link for downloading from GDrive, but we can also host the resulting bed files at https://majiq.biociphers.org/data/apaeval.

For the challengecode names I am thinking just keeping them as the suggested sample file names as we mentioned in the meeting to make them more human readable / interpretable. (e.g. those listed here https://docs.google.com/spreadsheets/d/1iZ-4RknIfLsfBJ8ank6w9JrkeVq0PEvoeLbtKUwghmA/edit#gid=899136187&range=B1:B55). These contain underscores that separate the same 3 fields (dataset/publication, experimental condition, replicate). Would you like them to be something else to make parsing easier?

The last thing I was thinking about with this naming structure is how to potentially group / name things that amount to a different configuration of the same tool. That's not a big issue now, but it could be. Say, for example, we run PAQR with replicates grouped together in the same experimental condition. Or we run QAPA with their prebuilt and filtered annotation vs. a new build using a different annotation. I think it makes the most sense to have a variation of the TOOL names above to handle these variations, but we can decided that later if/when we decide to test different configurations of the same tools.

I will work on renaming everything and putting them in the directory structure out outlined above!

ninsch3000 commented 2 years ago

Hey @mrgazzara , sample file names as challenge codes are in principle fine with me, but what do you do for the challenges that consist of multiple samples? Currently there are none I guess, but originally you had the plan to do some, didn't you?
As for the parsing, good point, maybe we should have TOOL.CHALLCODE.EVENTCODE.ext with "CHALLCODE" being DATASET_CONDITION_REPLICATE. We can then split on . which also nicely fits in with the GENOME mentioned above (with a . split), and you can actually figure out the group cases within your _ separated part later, if necessary.
Different configuration I agree, should be in the toolname. :+1:

ninsch3000 commented 2 years ago

No valid ideas for solving #230 yet. Ideally we could specify input directory instead of input file in the nextflow config. Hoping to get help from @abredondo

mrgazzara commented 2 years ago

Yes we had considered running groups of samples. I think this can be handled in the 'replicate' field of the challenge code that where it’s called “group” or something similar. Matching group level ground truth files would also have to be made so it can be run through the SWFs in the same exact way if/when we do these comparisons.

Of course we’ll make / update a Gsheet to keep track of every challenge code ;)

mrgazzara commented 2 years ago

@ninsch3000 The results we have so far have all been renamed and given the suggested directory structure above in the Barash lab GDrive for APAeval data here: https://drive.google.com/drive/folders/1kc2YjN2lljKDw-DR0TyHCZHctLMUm-KH?usp=sharing

I will work on re-doing the links to point to this new structure on the EWF progress GSheet over the next day

iRNA-COSI / APAeval

Summary workflows: automate running multiple challenges and participants #401

Directory structure

Programmatic download fromGDrive