makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

DiscoverY

DiscoverY is a tool to shortlist Y-specific contigs from an assembly of male whole genome sequencing data, based on exact k-mer matches with a female. This tool is platform agnostic and has been tested on male assemblies from Illumina, PacBio and 10x Genomics. The female can be a reference assembly or a low coverage raw reads dataset. DiscoverY can be ran in two different modes: female_only or female+male. In the female_only mode, the proportion shared between each contig with a female reference is computed; female+male mode uses both proportion of k-mers of each contig shared with a reference female and the kmer counts from the male raw reads to estimate each contig's depth-of-coverage.

Usage

Before running DiscoverY, the following input files are required in ./data folder. Note that currently the names of the "data" folder and of the files are hardcoded into DiscoverY.

male_contigs.fasta : contigs from WGS assembly, which will be annotated by running DiscoverY
kmers_from_male_reads : kmers from raw male reads used for assembly, used for computing coverage
[optional] female.bloom : Bloom Filter of kmers from female if available
[optional] female_kmers : kmers from female if Bloom Filter has not been constructed
[optional] female.fasta : Female reference in FASTA format if BF and kmer set both unavailable

See below for more info on generating these files.

A typical run of DiscoverY looks like this.

python discoverY.py --female_bloom --mode female+male

DiscoverY accepts the following parameters.

python discoverY.py [--help] 

The output of DiscoverY is an annotated file: proportion_annotated_contigs.fasta in which the fasta headers have information about the proportion shared with female and estimated depth of coverage.

The fasta file will be annotated like the following in the female_only mode:

>record_id length_of_contig proportion_shared_with_female

And in female+male mode:

>record_id length_of_contig proportion_shared_with_female median_k-mer_abundance

DiscoverY in 'best' mode

If the user has a labeled dataset, Discover Y can be run in 'best' mode after obtaining the annotated male contigs. The jupyter notebook classifier.ipynb in the classifier folder may be used, as follows:

    cd classifier
    jupyter notebook classifier.ipynb

This notebook uses labeled data to pick the best threshold of male_coverage and female_proportion to retrieve Y-contigs, by using a Linear SVM classifier. The precision and recall for the best combination is reported in the notebook, and a plot is generated with Y-contigs and the chosen threshold.

Installation

To download,

git clone https://github.com/makovalab-psu/DiscoverY

DiscoverY is written in Python 3 and requires some dependencies to be installed.

For the female_only and female+male modes used to annotate the contigs, the following packages can installed as follows:

pip install numpy
pip install biopython
pip install cython
pip install pybloomfiltermmap3

If the user also choses to run DiscoverY in best mode, the following dependencies should also be installed:

pip install sklearn
pip install matplotlib
pip install seaborn

DiscoverY also uses the k-mer counter DSK. The latest DSK binaries (v2.2.0 for Linux 64 bit and v2.2.0 for Mac OSX) are provided in the dependency folder. Thus, if you are using either of these operating systems, DSK need not be installed, and you may use the binaries as provided. For other operating systems, or if alternate versions or functionality of DSK is desired, see https://gatb.inria.fr/software/dsk/.

Example

The /data folder contains example data.

Input files

Generating kmers_from_male_reads

The kmers_from_male_reads is a file which includes a line for every distinct kmer in the male reads dataset, and each line contains the kmer sequence followed by whitespace followed by the number of times it occurs. For example, the first three lines could look like this.

AAAAAAAAAAAAAAAAGAAAAACAA 5
AAAAAAAAAAAAAAACAAGCTGAAT 8
AAAAAAAAAAAAAAAGAAAAACAAA 3

The kmers_from_male_reads file can be generated by the k-mer counter DSK. The ./dependency folder contains DSK binaries for Linux 64 bit and Mac OSX. Usage is as follows (example shown below is for a Linux system) :

cd dependency
./run_dsk_Linux.sh <FASTQ_file> <kmer_size>

If the k-mer counts file for male is not already provided, the user may need to generate k-mer counts manually using DSK. To generate k-mer counts with DSK, the following steps are needed :

cd dependency 
ln -s ../data/female.fasta  # make sure the correct reads file is provided to DSK
./run_dsk_Linux.sh r1.fastq 25  

The kmer_counts table will be generated in :

dependency/dsk_output/kmers_from_female

This file can be copied or linked to the data folder so that DiscoverY can use it :

cd ../data
ln -s ../dependency/kmers_from_female 

Miscellaneous

The following scripts are included with this distribution of DiscoverY, and are automatically run by discovery.py as part of the pipeline. Users may consider them separately for custom needs if required.

kmers.py

a set of general purpose functions to work with kmers

classify_ctgs.py

input : all contigs from WGS assembly, male read kmers and female kmers.
output : annotated contigs with proportions shared with female

License

This program is released under the MIT License. Please see LICENSE.md for details

Citation

If you use DiscoverY in your research, please cite :

Rangavittal, S., Stopa, N., Tomaszkiewicz, M. et al. BMC Genomics (2019) 20: 641. https://doi.org/10.1186/s12864-019-5996-3