rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

Add unit testing/continuous integration #37

Open anderspitman opened 6 years ago

anderspitman commented 6 years ago

It would be good to implement at least some basic testing and CI. The first step would be to come up with a standard small dataset to run against.

BobLiterman commented 6 years ago

Some considerations with respect to the sample dataset, an idea which I wholly agree with:

1) As you stated, small genomes would be great. I'm currently in hour 60 of a primate analysis, so we should be looking at things that are orders of magnitude smaller.

2) Ideally, at least one taxon from the dataset should have a well-annotated genome to facilitate dowsntream analyses.

3) Data should be a mixed of SE and PE reads in order to properly test different configurations. If SE reads are hard to come by, this could be accomplished artificially of course by partially splitting up a PE dataset.

anderspitman commented 6 years ago

@BobLiterman do you have (or would possibly be able to create) any small datasets that would be appropriate for unit testing (< 5 seconds to run) and/or integration testing (< 5 minutes)?

BobLiterman commented 6 years ago

Hi @anderspitman, I wonder if the best way to go with that would be to use a preassembled bacterial genome dataset and a small number of real reads? That way, reads are mapping to something with the biological complexity of a real dataset, but in a size that is fast to run? Using the premade flag, one can skip the redundant step of repetitive genome assembly each time a test is run, cutting down on time. Even with genome assembly happening each time, I would anticipate that bacterial datasets would run relatively quickly...

What about like using 5-6 strains of E. coli for the composite genome, then for reads, you could just put a small number in each taxa folder? 10,000? 1,000,000? If the goal is just to benchmark and unit test, I don't see the harm in going on the low end, as you just want the reads to map, sites to be called.

Thoughts all?

anderspitman commented 6 years ago

Unfortunately my input may not be very useful here. I know how assembly and alignment work, but barely understand what makes SISRS unique.

That said, using bacterial data seems like a good idea. Also using premade would be a great way to start. Since my primary goal for the testing stuff is to provide a baseline for a python port, I'm planning to start with the minimum amount of functionality. This will probably look like doing as much of the pipeline as possible with the old script, and only the final stages with the new code. As I verify each part of the pipeline I can work my way backwards adding more test data and earlier pipeline stages as I go.

Does that make sense?

BobLiterman commented 6 years ago

It makes sense to me. Keep in mind I have added options in SISRS to run specific subroutines alone. In that way you could make benchmark datasets for each stage in the pipeline and save them such that you can run any step you wanted in a standalone fashion.

You'd basically have a raw data folder, and then you would do a SISRS run calling each function, starting from the top and moving down. Each output can be set to go to a new directory, which can act as the starting point for the next function ie)

Raw Data --> Run Genome Assembly into Dir1 (-c 0 option set so SISRS doesn't continue past Step 1) Running from Dir1 --> Run Step 2 into Dir2 Running from Dir2 --> Run Step 3 into Dir3 etc. Running from Dir(n-1) --> Run final step into DirN

Then, as you port bash to Python from the bottom up, you can call as your starting data the directory containing pipeline output from the previous step's output ie)

1) Convert last function to Python 2) Call last function of SISRS using Dir(n-1) as starting data 3) If it works, convert next function up in pipeline to Python 4) Call SISRS second to last function (-c 1 so it runs through the end) using the appropriate data etc.

That's a lot and it would be really easy for me to describe either over Skype or some such thing if it's not clear.

BobLiterman commented 6 years ago

If you can give me a few days to get this grant done, I could probably create the data directories relatively quickly, and then you could just clone them all to have a clean version.

anderspitman commented 6 years ago

@BobLiterman, I'm pretty sure I understand, and I really like this idea! From a testing point of view, there are several advantages. For one, when making changes to a specific part of the pipeline, the developer only has to worry about running unit tests for that stage, which will be a lot faster than running the entire pipeline. Entire pipeline tests would only need to be run when submitting a PR.

Also, when writing tests, you typically have your expected data and your observed data, and you check to verify they are the same. With your proposal the expected data would always be available as the next stage of the pipeline, with the only exception being the final stage.

That would be awesome if you can get the test data put together, separated by directories, with a different directly for each pipeline stage. Just let me know when you're able to do it. I should be able to get quite a bit of the architecture in place to support this.

Also let me know if I seem to be misunderstanding what you're suggesting.

BobLiterman commented 6 years ago

I think we're on the same page. I'll look into this more next week.

anderspitman commented 6 years ago

@BobLiterman one other thing we need to keep in mind is that we're going to want to have the example data use as many features as possible, in order to test most of the code paths. That may necessitate multiple versions of the same pipeline stage for runs with different options.

BobLiterman commented 6 years ago

I gave this a first crack, with some caveats.

https://github.com/BobLiterman/SISRS_Small_Data

Basics: 1) I took 49 contigs from a primate SISRS run. (Damn header row) 2) Extracted associated reads for 8 species to recreate raw data 3) Created some special case data (2 individuals in a folder, all SE reads, SE + PE reads) 4) Ran SISRS step-by-step, outputting to a new folder each time.

Notes:

1) I have not used the reference genome option in my personal work, and don't have a great sense of that side of the program. I didn't run that here because I don't know what I would run it against. Rachel mentioned that it's possible we may ditch it, but we can talk about that if/when needs be.

2) I didn't do anything with the loci half of SISRS yet. I figured we can tackle sites, and then move on from there.

Ran fast for me with 20CPU. Let me know if it suits your needs.

anderspitman commented 6 years ago

This looks great @BobLiterman! I'll start playing with it and let you know if I have any feedback. Sounds like reference and loci aren't used as much so ignoring them for now is just fine by me. The python port I'm working on will be a great opportunity to drop features we don't need. I'm starting to guess we might actually just have the Python version become SISRS 2.0. That will make it easier for people to continue using any legacy features of 1.x while allowing development to move forward.

anderspitman commented 6 years ago

Hey @BobLiterman, a couple questions.

Is it necessary to run against all 8 species, or can we get meaningful results from just one or two? I see in the SISRS documentation it says a site only requires 2 species to be included in the final alignment but I'm not sure if the species you used mostly cover the same sites or not.

I'm just trying to minimize the amount of processing needed for each test run. Keep in mind travis only provides 1 CPU as far as I know, and I believe the run time is limited to 5 minutes or so.

BobLiterman commented 6 years ago

Well, in order to identify biallelic sites you need at least 4 taxa. So in that sense you could reduce it that far and still get all the output. That would be as simple as just removing four taxa and re-running the sub-steps, which I could do in a few minutes.

anderspitman commented 6 years ago

Sweet. I don't think it's necessary to do just yet. I should be able to do all the plumbing with what we have now. But we'll leave that as an optimization if we run into trouble meeting travis' constraints.

anderspitman commented 6 years ago

What are some good tools for comparing BAM files to determine if they're semantically the same? ie I need to make sure the data I'm generating matches the output in your repo for each stage of the pipeline.

anderspitman commented 6 years ago

Another question: For the commands listed in the Commands file, what directory did you run each of them from? I'm guessing from within 0*, 1*, etc?

BobLiterman commented 6 years ago

Make raw data (d0), Copy raw folder to dir1, run command 1 in dir1, Copy new dir1 to dir2, Run command 2 in dir2, etc

rachelss commented 6 years ago

Comparing BAMs. But keep in mind that mapping won't be identical every time so the files might not be the same and that might be ok.

http://seqanswers.com/forums/showthread.php?t=17644 http://deeptools.readthedocs.io/en/latest/content/tools/bamCompare.html

anderspitman commented 6 years ago

Thanks @rachelss. We'll definitely need to have deterministic behavior at each pipeline stage in order to properly do CI. That might not be too difficult. bowtie2 for example appears to support this (see this page).

I'll look into this more. Off the top of your heads what sources of non-determinism should I be looking out for? Basically anywhere a random number is used we'll need a way to set the seed the same when running tests.

anderspitman commented 6 years ago

FYI the bamUtil diff command seems to do exactly what we need for comparing BAM files. I've tested it a bit and even if the files don't match on the binary level (ie using diff or cmp), bamUtil does the job.

anderspitman commented 6 years ago

@rachelss @BobLiterman I'm noticing some strange behavior. When running the command

sisrs alignContigs -p X -a premade -c 0 -f 0_RawData_PremadeGenome -z output

With different values of X for number of processors, I'm getting different output. Specifically, I noticed that the number of reported reads for GorGor (first line after "==== Aligning GorGor as Single-Ended ====") is different for lower numbers of CPUs. Here's a table of my results:

Num Processors Reported # Reads
1 14690
2 29380
3 46012
>= 4 62644

It appears to saturate at 4+ CPUs. I don't really know what, if anything, this indicates. Any ideas?

anderspitman commented 6 years ago

Also noticed 29380 is 2*14690. The other values increase on the same order but are not exact multiples.

anderspitman commented 6 years ago

@BobLiterman additionally, when running alignContigs I haven't been able to get GorGor.bam to match what's in 0_RawData_PremadeGenome, even if I used -p 20 as you did. The output from SISRS is also different, though it does report 62644 reads. Are you sure the commands you ran match what's in the Commands file?

I am able to get identically data between 2 runs on my machine.

Thanks!

anderspitman commented 6 years ago

**Haven't been able to get it to match what's in 1_alignContigs

BobLiterman commented 6 years ago

Definitely those are the commands. I don't run anything species-specific. Everything was run in batch with the commands as given. I ran the following command using the same data:

sisrs alignContigs -p 5 -a premade -c 0 -f DIR/0_RawData_PremadeGenome/ -z DIR/1_alignContigs_5CPU &> 1_alignContigs_5CPU_Log

I cannot recreate your processor-dependent mapping issue, as my Bowtie logs are identical (see attached file)

bowtieprocessors

BobLiterman commented 6 years ago

Identical with 1CPU as well.

BobLiterman commented 6 years ago

Running the entire SISRS sites program with 5CPU starting from the raw data yielded identical counts of variable, biallelic, and PI sites as well, on my end.

anderspitman commented 6 years ago

Strange. Yeah when I run the following command:

~/code/sisrs/SISRS/bin/sisrs alignContigs -p8 -a premade -c 0 -f ~/code/sisrs/SISRS/test_data/0_RawData_PremadeGenome/ -z output1 &> alignContigs_Log1

Snippets from the results are below. You can see that the counts are the same (ie 62644 GorGor unpaired reads in both cases), but the percentage breakdowns are quite different. I don't think this is a randomness problem because the results always match between 2 runs I do on my machine (other than the weird CPU <=4 thing). I'm not sure what the problem is. Any ideas? For now I think I'll just regenerate the data for the various stages from your source data, since that seems to be working as long as it's all on the same machine. Might be a problem in the long run though when we try to deploy to the continuous integration server.

Here's mine:

==== Aligning GorGor as Single-Ended ====
62644 reads; of these:
  62644 (100.00%) were unpaired; of these:
    9316 (14.87%) aligned 0 times
    53258 (85.02%) aligned exactly 1 time
    70 (0.11%) aligned >1 times
85.13% overall alignment rate
==== Aligning HomSap as Single-Ended ====
28712 reads; of these:
  28712 (100.00%) were unpaired; of these:
    2568 (8.94%) aligned 0 times
    26073 (90.81%) aligned exactly 1 time
    71 (0.25%) aligned >1 times
91.06% overall alignment rate
==== Aligning HylMol as Single-Ended ====
24872 reads; of these:
  24872 (100.00%) were unpaired; of these:
    6883 (27.67%) aligned 0 times
    17935 (72.11%) aligned exactly 1 time
    54 (0.22%) aligned >1 times
72.33% overall alignment rate
==== Aligning MacFas as Single-Ended ====
14760 reads; of these:
  14760 (100.00%) were unpaired; of these:
    3690 (25.00%) aligned 0 times
    11028 (74.72%) aligned exactly 1 time
    42 (0.28%) aligned >1 times
75.00% overall alignment rate
==== Aligning MacMul as Single-Ended ====
20392 reads; of these:
  20392 (100.00%) were unpaired; of these:
    6249 (30.64%) aligned 0 times
    14109 (69.19%) aligned exactly 1 time
    34 (0.17%) aligned >1 times
69.36% overall alignment rate
==== Aligning PanPan as Single-Ended ====
86790 reads; of these:
  86790 (100.00%) were unpaired; of these:
    13165 (15.17%) aligned 0 times
    73573 (84.77%) aligned exactly 1 time
    52 (0.06%) aligned >1 times
84.83% overall alignment rate
==== Aligning PanTro as Single-Ended ====
86272 reads; of these:
  86272 (100.00%) were unpaired; of these:
    13462 (15.60%) aligned 0 times
    72735 (84.31%) aligned exactly 1 time
    75 (0.09%) aligned >1 times
84.40% overall alignment rate
==== Aligning PonPyg as Single-Ended ====
29114 reads; of these:
  29114 (100.00%) were unpaired; of these:
    9069 (31.15%) aligned 0 times
    20019 (68.76%) aligned exactly 1 time
    26 (0.09%) aligned >1 times
68.85% overall alignment rate

and yours:

==== Aligning GorGor as Single-Ended ====
62644 reads; of these:
  62644 (100.00%) were unpaired; of these:
    7611 (12.15%) aligned 0 times
    54898 (87.63%) aligned exactly 1 time
    135 (0.22%) aligned >1 times
87.85% overall alignment rate
==== Aligning HomSap as Single-Ended ====
28712 reads; of these:
  28712 (100.00%) were unpaired; of these:
    2062 (7.18%) aligned 0 times
    26589 (92.61%) aligned exactly 1 time
    61 (0.21%) aligned >1 times
92.82% overall alignment rate
==== Aligning HylMol as Single-Ended ====
24872 reads; of these:
  24872 (100.00%) were unpaired; of these:
    5207 (20.94%) aligned 0 times
    19588 (78.76%) aligned exactly 1 time
    77 (0.31%) aligned >1 times
79.06% overall alignment rate
==== Aligning MacFas as Single-Ended ====
14760 reads; of these:
  14760 (100.00%) were unpaired; of these:
    3155 (21.38%) aligned 0 times
    11547 (78.23%) aligned exactly 1 time
    58 (0.39%) aligned >1 times
78.62% overall alignment rate
==== Aligning MacMul as Single-Ended ====
20392 reads; of these:
  20392 (100.00%) were unpaired; of these:
    4972 (24.38%) aligned 0 times
    15373 (75.39%) aligned exactly 1 time
    47 (0.23%) aligned >1 times
75.62% overall alignment rate
==== Aligning PanPan as Single-Ended ====
86790 reads; of these:
  86790 (100.00%) were unpaired; of these:
    10831 (12.48%) aligned 0 times
    75868 (87.42%) aligned exactly 1 time
    91 (0.10%) aligned >1 times
87.52% overall alignment rate
==== Aligning PanTro as Single-Ended ====
86272 reads; of these:
  86272 (100.00%) were unpaired; of these:
    11040 (12.80%) aligned 0 times
    75110 (87.06%) aligned exactly 1 time
    122 (0.14%) aligned >1 times
87.20% overall alignment rate
==== Aligning PonPyg as Single-Ended ====
29114 reads; of these:
  29114 (100.00%) were unpaired; of these:
    7024 (24.13%) aligned 0 times
    22059 (75.77%) aligned exactly 1 time
    31 (0.11%) aligned >1 times
75.87% overall alignment rate
anderspitman commented 6 years ago

@BobLiterman I've been trying to get consistent behavior without too much success. I think I've narrowed a good chunk of it down to a specific problem. Can you run the following commands from within the GorGor directory and share your output?

bowtie2-build ../premadeoutput/contigs.fa ./contigs
bowtie2 -p 1 -N 1 --local -x ./contigs -U GorGor_A_R1.fastq,GorGor_A_R2.fastq,GorGor_B_R1.fastq,GorGor_B_R2.fastq > out.sam
anderspitman commented 6 years ago

When varying -p X I am getting different answers for anything less than 4. Tried on 2 different machines. Could this be a bug in bowtie or am I just completely misunderstanding what the output means?

anderspitman commented 6 years ago

Sorry, that wasn't specific enough. Can you run the commands from 0_RawData_PremadeGenome/GorGor.

BobLiterman commented 6 years ago

Ran from 0_RawData_PremadeGenome folder. Same results each time. You may want to consider reinstalling Bowtie2?

bowtie2-build premadeoutput/contigs.fa premadeoutput/contigs

1_Processor

bowtie2 -p 1 -N 1 --local -x ./premadeoutput/contigs -U ./GorGor/GorGor_A_R1.fastq,./GorGor/GorGor_A_R2.fastq,./GorGor/GorGor_B_R1.fastq,./GorGor/GorGor_B_R2.fastq > out.sam

62644 reads; of these: 62644 (100.00%) were unpaired; of these: 7611 (12.15%) aligned 0 times 54898 (87.63%) aligned exactly 1 time 135 (0.22%) aligned >1 times 87.85% overall alignment rate

5_Processors

bowtie2 -p 5 -N 1 --local -x ./premadeoutput/contigs -U ./GorGor/GorGor_A_R1.fastq,./GorGor/GorGor_A_R2.fastq,./GorGor/GorGor_B_R1.fastq,./GorGor/GorGor_B_R2.fastq > out.sam

62644 reads; of these: 62644 (100.00%) were unpaired; of these: 7611 (12.15%) aligned 0 times 54898 (87.63%) aligned exactly 1 time 135 (0.22%) aligned >1 times 87.85% overall alignment rate

20_Processors

bowtie2 -p 20 -N 1 --local -x ./premadeoutput/contigs -U ./GorGor/GorGor_A_R1.fastq,./GorGor/GorGor_A_R2.fastq,./GorGor/GorGor_B_R1.fastq,./GorGor/GorGor_B_R2.fastq > out.sam

62644 reads; of these: 62644 (100.00%) were unpaired; of these: 7611 (12.15%) aligned 0 times 54898 (87.63%) aligned exactly 1 time 135 (0.22%) aligned >1 times 87.85% overall alignment rate

anderspitman commented 6 years ago

Very strange. Maybe it is bowtie.

anderspitman commented 6 years ago

Latest version of bowtie2 (2.3.3.1) seems to fix it! Thanks for the suggestion. Just out of curiosity, which version are you using? 2.3.2 is the one that wasn't working for me.

BobLiterman commented 6 years ago

The current version on my cluster is 2.2.4

anderspitman commented 6 years ago

@rachelss @BobLiterman, I ended up settling on bamHash for comparing BAMs. bamUtil didn't end up working because I noticed occasionally bowtie2 would swap a couple lines in its output. Even though the files were semantically the same, bamUtil called them different. The way bamHash works is that it goes through a bam file and makes a hash of all the read names and sequences. So if two files have the same read names and sequences they will have the same hash, regardless of the order of the reads. This is simple and quite fast. My concern is if this is sufficient to guarantee that two runs of SISRS are producing the same output. Keep in mind my goal is to detect if a change to the SISRS code causes it's output to change in a meaningful way. So my question is, if we can guarantee that 2 BAM files contain the same set of read names and sequences, is that enough to guarantee the files are the same for our purposes? Or do we also need to verify the mapping locations, etc?

BobLiterman commented 6 years ago

Given a genome and a read, issuing the same Bowtie command should give you the same mapping (From our tests here). For your purposes (converting Bash to Python starting from the same data), this shouldn't be an issue. (As opposed to if you were tweaking the fundamentals of how Bowtie was mapping or the genome was being assembled).

rachelss commented 6 years ago

A bam with the same read names and sequences does not necessarily have the same mapping. If your map data to two different genomes the reads and sequences will be the same, but obviously they will map differently. For a simple dataset (reads and genome) I would expect reads to all map the same place, although for complex data reads could map differently and still be accurate. If possible I would verify mapping for the test data.

BobLiterman commented 6 years ago

I believe, and correct me if I'm wrong, that this is just for the Python port process. Not to integrate into SISRS proper

rachelss commented 6 years ago

@anderspitman - This is for testing regardless of port or not? I'm not sure a change in mapping indicates a detrimental change in code. Assembly could vary and mapping will vary. I'll have to think about this.

anderspitman commented 6 years ago

Some clarification and good news:

To @BobLiterman's point, it is true that Bowtie should be deterministic. What I have observed though is if you have two reads that have the same name, samtools seems to randomly decide how to sort them. This is because samtools sorts by name, regardless of other content. In the test data we're using there are some cases of this.

That said, after thinking about it a bit more, I realized I was overcomplicating things. I created a simple script that takes two BAMs you want to compare, strips the headers and converts to SAM, then uses the built in GNU sort to sort them, then compare using GNU diff. This is a deterministic sort that solves the problem for all the test data I'm currently using. So as of right now for the tests I'm comparing all the content of the BAMs coming out of SISRS, including mapping.

You can see my biostars question here for me details.

anderspitman commented 6 years ago

If we end up having issues with other data or non-bowtie2 aligner down the line, we might need to look into this more, but it seems pretty good for now.

rachelss commented 6 years ago

Basic sort and compare seems very reasonable