wasade / exhaustive

Other
5 stars 1 forks source link

Add python script for easy of use #3

Closed hunglin59638 closed 6 months ago

hunglin59638 commented 7 months ago

I have reviewed the sbatch scripts and merged them into a Python script (exhaustive.py) for ease of use. Additionally, I have addressed some bugs in filter_fna.py and trim-to-max-length.py.

Bug fix

filter_fna.py

Fixed a bug where providing a fasta file containing header descriptions would raise a KeyError from coordinates[id_]. This occurred, for example, when encountering headers like >NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0.

trim-to-max-length.py

Resolved a NameError issue where the variable noclue was referenced but not defined. As it appeared unnecessary, it has been removed from the script.

Python script

Help message

python exhaustive.py --help

Usage: exhaustive.py [OPTIONS]

  Run exhaustive mapping of kmer substrings to a database.

Options:
  -l, --base_label PATH  base label for output files  [default: exhaustive]
  -d, --db_fna FILE      the database in fasta format to map against
                         [required]
  -b, --db_bt2 TEXT      the bowite2 index of the database to map against, it
                         will be built if not provided
  -f, --files FILE       the paths of genomes to generate substrings from
  -k, --kmer INTEGER     kmer length  [default: 150]
  -j, --jump INTEGER     jump length  [default: 75]
  -t, --threads INTEGER  number of threads to use  [default: 8]
  -c, --clean            remove temporary files
  --help                 Show this message and exit.

Demo

mamba env create -n exhaustive -f /path/to/environment.yml
mamba activate exhaustive
cd /path/to/work
# Download a human mitochondrion sequence to simulate as a contaminated database
wget -q -O mito.fna "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=GU170821.1&rettype=fasta"
bowtie2-build mito.fna mito-bt2

wget -q -O - https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz | 
gunzip -c > human.fna
realpath human.fna > genomes.txt

/path/to/exhaustive.py --base_label demo --db_fna  mito.fna --db_bt2 mito-bt2 --files genomes.txt

Output files

.
├── demo_150_75 # temp directory
│   ├── awk.aggregated
│   ├── awk.aggregated.sorted
│   ├── concat.fna
│   ├── concat.fna.fai
│   ├── exhaustive.bed
│   ├── exhaustive.sam
│   └── exhaustive.trimmed.bed
├── demo-bt2 # masked bowtie2 index
│   ├── demo-bt2.1.bt2
│   ├── demo-bt2.2.bt2
│   ├── demo-bt2.3.bt2
│   ├── demo-bt2.4.bt2
│   ├── demo-bt2.rev.1.bt2
│   └── demo-bt2.rev.2.bt2
├── demo.contaminated.fna # contaminated regions
├── demo.masked.fna  # masked database
wasade commented 7 months ago

Thanks!! Good catch on the noclue... I made a minor change without actually looking close. I just pushed in a fix, would it be possible to pull main?

I like this approach a lot, and the code is quite clean. My current hesitations are that the commands themselves are replicated in both the sbatch scripts, and the Python code. And the execution of the python code is bound to a single job, which complicates the resource request particularly for large jobs. Of those two, I'm more concerned about duplication as it would be very easy to, for example, change parameters in one place but not the other. That said, I'm not opposed to something generally like what's done here.

Do you have any thoughts (or desire) on how to at least address the duplication?

hunglin59638 commented 7 months ago

The duplication issue can be resolved by splitting the python code into subcommands, which can then be invoked by the sbatch scripts.

Usage: exhaustive.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  export-db        Export staged database fasta
  process-kmers    Mapping kmer substrings to database
  process-regions  Process regions
  filter-db        Mask database fasta with contaminated regions and...
  submit           Run exhaustive cleaning (export-db -> process-kmers ->...

Additionally, I want to execute exhaustive with a python script because some users' PCs (like me) may not have Slurm installed to use sbatch command.

wasade commented 7 months ago

Hi @hunglin59638, this is really cool, and I agree quite helpful -- thank you so much for interest here. I'd like to test out locally before merge, it may be a few days before I can do so.

rohitfarmer commented 7 months ago

Hi @hunglin59638 @wasade, I am trying to implement this method. I locally merged this pull request and followed the demo mentioned above. But I don't see all the options as mentioned in the first section, and when I run the exhaustive.py, I get the following error.

python3 exhaustive.py --base_label demo --db_fna  mito.fna --db_bt2 mito-bt2 --files genomes.txt
Usage: exhaustive.py [OPTIONS] COMMAND [ARGS]...
Try 'exhaustive.py --help' for help.

Error: No such option: --base_label

When I execute python3 exhaustive.py --help this is what I get.

Usage: exhaustive.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  export-db        Export staged database fasta
  process-kmers    Mapping kmer substrings to database
  process-regions  Process regions
  filter-db        Mask database fasta with contaminated regions and...
  submit           Run exhaustive cleaning (export-db -> process-kmers ->...
hunglin59638 commented 7 months ago

@rohitfarmer To address the duplication issue, the Python code has been divided into subcommands. Please type python3 exhaustive.py submit --help to view the available options.

suchetagodbole commented 7 months ago

HI ! I have a basic question in using this pipeline. Do I need to create the contaminant database bt2 files for each sample (read1 and read2) I am trying to run through the pipeline? thanks in advance.

wasade commented 6 months ago

Sorry for such a delay, thanks @hunglin59638!