pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Rethinking looper testing #464

Closed nsheff closed 3 weeks ago

nsheff commented 4 months ago

We seem to have lots of pytests for looper, and even a 'smoketests' subfolder.

What is the distinction between the regular tests and the smoketests? at first glance, they just look like other unit tests.

Anyway, it's clear to me after tying to get looper to work today that our tests are not effective for actually testing looper functionality. I ran into so many little issues that it's clear we're not actually testing the way looper is actually used.

I think we need to rethink the way we are testing this package. I think we should redesign tests around some kind of more holistic use cases, and maybe step away from testing small modular units. It just doesn't seem to be working.

A possible test would be: a few basic dummy pipelines that execute something. Run these on some demo datasets, with some expected failures, some re-runs, etc. Basically, the kind of use that a person would actually use looper to do.

nsheff commented 4 months ago

One idea would be to have a test that clones the hello_looper repository and then runs these example pipelines.

Or something like that. I think we did this for peppy, having it use the example_peps repo.

donaldcampbelljr commented 4 months ago

Ok, I've started a branch to work on this. I have a skeleton to have two large tests for using looper with and without pipestat integration. I'll look at the peppy example and the pypiper unit test for inspiration as I continue.

donaldcampbelljr commented 4 months ago

I have a proof of concept going where I can clone the hello_looper repository into a tempdir and execute the pipestat configured run successfully.

donaldcampbelljr commented 4 months ago

One issue I've run into when using Looper with Pipestat is that I would clone the hello_looper repo and run the command: looper run --looper-config /tmp/tmp5h3_if__/pipestat/.looper_pipestat_shell.yaml

Pytest is running the command from a directory that is not the pipestat folder. I can manually reproduce this.

Issue: Pointing to the looper config file works for everything except finding the samples. Looper complains it cannot find:

wc: data/frog2_data.txt: No such file or directory

This is because the samples.csv file tells looper to look in the data folder relative to the location of the .looper.yaml file.

One option is to change the pipeline interface such that (see asterisks below):

command_template: >
  {looper.piface_dir}/count_lines_pipestat.sh **{sample.file}** {sample.sample_name} {pipestat.config_file}

becomes something like:

command_template: >
  {looper.piface_dir}/count_lines_pipestat.sh **{looper.data_directory}{sample.file}** {sample.sample_name} {pipestat.config_file}

In reality I know the data directory is relative to to the config file, so I could just use the config file directory when writing the interface. However, I don't see this as part of the looper namespace that is built for this purpose.

It seems like the proper solution is that Looper should get the absolute path to the data file which is does not. Here is an example submission script that exhibits the issue.

{
/tmp/tmp5h3_if__/pipestat/./pipeline_pipestat/count_lines_pipestat.sh 
data/frog1_data.txt 
frog_1 
/tmp/tmp5h3_if__/pipestat/looper_pipestat_config.yaml 
} | tee /tmp/tmp5h3_if__/pipestat/./results/submission/example_pipestat_pipeline_frog_1.log
donaldcampbelljr commented 4 months ago

Official documentation on PEP suggests derived columns to eliminate paths from the sample table: https://pep.databio.org/spec/howto-eliminate-paths/

So my above solution is probably not actually what we want.

donaldcampbelljr commented 4 months ago

Ok, I refactored the pipestat hello_looper example (branch dev_derive) to use derived attributes: https://github.com/pepkit/hello_looper/commit/0927058cca0e779294f2844a20971aa61d3171b3

Then, in looper pytest, I open the project config and replace the source1 path with the new temp_dir path: https://github.com/pepkit/looper/commit/d4cfe2362ea2244382f8201aae342acf16b80651

This (sort of) simulates the user using an environment variable to point to the data.

This now appears to be working in pytest as desired AND the hello_looper examples works exactly the same as before from a user perspective.

donaldcampbelljr commented 3 months ago
donaldcampbelljr commented 3 months ago

Refactoring of all tests is now complete and PEPs pull from the hello-looper dev branch. I've also added comprehensive tests, but, as discussed, these will continue to be a WIP as we add complexity to these comprehensive tests.