harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Develop Test Datasets #12

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

At the moment, we have only ad hoc testing for this pipeline. To facilitate development and distribution, we need both unit tests and one or a few example data sets to package with the repository so users can test and make sure things are working.

For unit tests, given the right input data set (small and fast), we can use Snakemake's automatic unit test generation, described here.

For a sample dataset, ideally we want something that runs fast, but also hits as many rules as possible. E.g., the current E. coli data does not run the interval creation rules. It may not be possible to come up with a single dataset that is sufficient for all testing purposes.

Let's use this issue to manage discussion of test data.

cademirch commented 2 years ago

I was thinking the Bhduck test data in main could be a good starting point? I will work on implementing the automatic unit tests. For the downloading rules, we should find a super small SRR and genome that would be quick to download so the test doesn't take too long.

Also, it would be cool to implement GitHub actions for automatic integration testing. Snakemake already has resources for this: https://github.com/snakemake/snakemake-github-action. Thoughts?

tsackton commented 2 years ago

Yeah, the BHduck should be a decent starting place. There are a few rules that data doesn't hit by default, I think the interval creation stuff (because the 'genome' is too small), and obviously the download stuff.

Agree completely on Github actions. Would be awesome to get that working too.

For a super small SRR, here is the smallest I can turn up with some quick Googling: SRR13660059. File size 4.1 Mb, should be manageable.

tsackton commented 2 years ago

Closed (for now) by #19