nhoffman / bioy

Tools for NGS sequence analysis and bacterial classification
GNU General Public License v3.0
0 stars 0 forks source link

========================================== bioy: a collection of bioinformatics tools

Bio-y (pronounced "bio-ee") The adjective form of the noun "Bio"

.. contents:: Table of Contents

authors

dependencies

Some functions require the following python packages:

And other require external programs, including:

installation

To install bioy and python dependencies, run setup.py or pip from the project directory::

% cd bioy % python setup.py install

or

% pip install -U .

If you don't want to install the dependencies (numpy and pandas take a while to compile), use::

% pip install --no-deps -U .

Numpy and pandas require many dependencies to compile (and you'll likely need to compile them because versions in package managers are typically out of date). Fortunately, these can pretty easily be installed on Ubuntu 12.04 by running::

% sudo apt-get build-dep python-numpy python-pandas

A virtualenv containing a complete python execution environment can be created using dev/bootstrap.sh::

% dev/bootstrap.sh -h Create a virtualenv and install all pipeline dependencies Options: --venv - path of virtualenv [bioy-env] --python - path to the python interpreter [/usr/local/bin/python] --wheelstreet - path to directory containing python wheels; wheel files will be in a subdirectory named according to the python interpreter version; uses WHEELSTREET if defined. (a suggested location is ~/wheelstreet) [] --requirements - a file listing python packages to install [requirements.txt]

execution

The bioy script provides the user interface, and uses standard UNIX command line syntax. Note that for development, it is convenient to run bioy from within the project directory by specifying the relative path to the script::

% ./bioy

Commands are constructed as follows. Every command starts with the name of the script, followed by an "action" followed by a series of required or optional "arguments". The name of the script, the action, and options and their arguments are entered on the command line separated by spaces. Help text is available for both the bioy script and individual actions using the -h or --help options::

usage: bioy [-h] [-V] [-v] [-q]

      {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch}
      ...

Tools for microbial sequence analysis and classification.

positional arguments: {help,align_clusters,all_pairwise,blast,children,classifier,classify,cmscores,consensus,csv2fasta,csv2hdf5,csvmod,dedup,denoise,errors,fasta,fasta2csv,fastq_stats,gb2fa,index,map_clusters,primer_trim,pull_reads,repl,reshape,reverse_complement,rldecode,rlencode,split_barcodes,split_reads,ssearch,ssearch2csv,ssearch_count,tree_edit,tsv2csv,usearch} help Detailed help for actions using help <action> align_clusters Align reads contributing to a denoised cluster. all_pairwise Calculate all Smith-Waterman pairwise distances among sequences. blast Run blastn and produce classify friendly output children Return the children of a taxtable given a list of taxids classifier Classify sequences by grouping blast output by matching taxonomic names classify Classify sequences by grouping blast output by matching taxonomic names cmscores Convert raw cmalign alignment scores to csv format. consensus Calculate the consensus for a multiple aignment csv2fasta Turn a csv file into a fasta file specifying two columns csv2hdf5 Convert a csv file to HDF5 csvmod Add or rename columns in a csv file. dedup Fast deduplicate sequences by coalescing identical substrings denoise Denoise a fasta file of clustered sequences errors Tally and classify errors given ./ion rlaligns reference and query sequences fasta Run the fasta pairwise aligment tool and output in csv format. fasta2csv Turn a fasta file into a csv fastq_stats Describe distributions of sequencing quality scores gb2fa Outputs a standard Genbank Record File into fasta file format and optional seqinfo file in format ['seqname', 'tax_id','accession','description','length','ambig_cou nt','is_type','rdp_lineage'] index Add simple indices to an sqlite database map_clusters Create a readmap and specimenmap and/or weights file from a ncbi_fetch Fetch sequences from NCBI's nucleotide database using sequence identifiers (gi or gb) primer_trim Parse region between primers from fasta file pull_reads Parse barcode, primer, and read from a fastq file repl Replace strings in one or more files. reshape convert a tsv file to a csv with an optional split/add columns feature reverse_complement reverse complement rle and non-rle sequences rldecode Run-length decode a fasta file rlencode Run-length encode a fasta file split_barcodes Partition reads in a fastq file by barcode and write an annotated fasta file split_reads Parse reads from a fasta file by read to specimen csv map file ssearch Run the ssearch (Smith-Waterman) pairwise aligment tool and output in csv format. ssearch2csv Parse ssearch36 -m10 output and print specified contents ssearch_count Tally ssearch base count by position tree_edit Tree leaf name editor that wraps BioPython. tsv2csv convert a tsv file to a csv with an optional split/add columns feature usearch Run usearch global and produce classify friendly output

optional arguments: -h, --help show this help message and exit -V, --version Print the version number and exit -v, --verbose Increase verbosity of screen output (eg, -v is verbose, -vv more so) -q, --quiet Suppress output

versions

We use abbrevited git sha hashes to identify the software version::

% ./bioy --version 0128.9790c13

The version information is saved in bioy_pkg/data when setup.py is run (on installation, or even by executing python setup.py -h).

unit tests

Unit tests are implemented using the unittest module in the Python standard library. The tests subdirectory is itself a Python package that imports the local version (ie, the version in the project directory, not the version installed to the system) of the package. All unit tests can be run like this::

% ./testall
...........
----------------------------------------------------------------------
Ran 11 tests in 0.059s

OK

A single unit test can be run by referring to a specific module, class, or method within the tests package using dot notation::

% ./testone -v tests.test_utils

documentation

To build the Sphinx docs::

(cd docs && make html)

And to publish to GitHub pages::

ghp-import -p docs/_build/html

(ghp-import and Sphinx are both included in the requirements.txt)

license

Copyright (c) 2012 Noah Hoffman

Released under the GPLv3 <http://www.gnu.org/copyleft/gpl.html>_ License