peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Update prepare & pipeline to handle multiple plates #212

Closed peterjc closed 4 years ago

peterjc commented 4 years ago

Currently the thapbi_pict pipeline command and thapbi_pict prepare are expected to be called on an entire plate at once, for automatic setting of the minimum abundance threshold based on the negative control samples.

It would be nice to be able to run the pipeline on a collections of plates (e.g. list of folders), and probably the easier way to do this is for the prepare step to set the abundance threshold on a folder-by-folder basis?

peterjc commented 4 years ago

Using #213 this seems to work, but for optimal cluster usage would want to split the jobs (and doing it by plate/pool seems natural).

However, pooling all the input raw data into a shared intermedia/ folder lends itself to incremental updates, e.g. something like

$ thapbi_pict pipeline -i raw_data/ -n raw_data/*/control*.fastq.gz -s intermediate/ -o reports/

where raw_data/ would contain folders for each sequencing run - which would have their minimum abundance threshold set as a pool, or:

$ thapbi_pict pipeline -i thapbi20*/raw_data -n thapbi20*/raw_data/{GC,gBlock}*R[12]*.fastq.gz -s intermediate/ -o reports

where the regular expression allows for two control naming conventions.

peterjc commented 4 years ago

Works OK now, but duplicated sample names are a problem (thus far only happened on control samples). Might need to incorporate folder names as a prefix for intermediate files?

Or, could be resolved via #219 by using a DB rather than intermediate files on disk.