raunaq-m / MLEHaplo

Maximum Likelihood Estimation for Viral Populations
http://arxiv.org/abs/1502.04239
BSD 2-Clause "Simplified" License
5 stars 6 forks source link

ViPRA-Haplo

ViPRA-Haplo: de novo reconstruction of viral populations using paired end sequencing data

Pre-requisites:

  1. Trimmomatic: trimming
  2. Karect: error correction
  3. multi-dsk: k-mer counting software ( Extension of dsk [http://minia.genouest.org/dsk/] version 1.5655)
    • Counts k-mers for multiple values of k simultaneously. The software doesn't combines the counts of forward and reverse complement k-mers, as is performed in traditional k-mer counting softwares
  4. perl with modules Bio::Perl, Getopt::Long, Graph.

Dockerfile

A dockerfile containing all the dependencies is now available. Find more documenation on Docker here.

To build a docker image, you would need Docker installed in your HPC or cloud environment. Create an empty folder with just the Dockerfile in it and type

docker build --tag mlehaplo:1.0 . 

This will create the docker container containing all relevant scripts and libraries for you to run MLEHaplo. You can now start a live container by typing:

docker run -it mlehaplo:1.0

ViPRA-Haplo workflow

Fig. 1. Reconstruction of viral haplotypes using paired-end data

The following is a list of steps/commands you'd have to follow to run MLEHaplo.

Preliminaries

Step 1: Trimming and Error Correction for real data

Example:

java -jar Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 SRR13744684_1.fastq SRR13744684_2.fastq
SRR13744684_forward_paired.fq SRR13744684_forward_unpaired.fq SRR13744684_reverse_paired.fq SRR13744684_reverse_unpaired.fq
ILLUMINACLIP:Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
MINLEN:60
karect -correct -inputfile=SRR13744684_1.fa -inputfile=SRR13744684_2.fa -celltype=haploid
-matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25

Step 2: Generate k-mer counts file

Command: multi-k fasta/fastq_file list_of_kvalues -d diskspace_limit -m memory_limit

Example:

./multi-dsk/multi-dsk Example/paired-reads.fasta  Example/listofkmers.txt  -m 8192 -d 10000

Step 3: De-compress the output of multi-dsk

Command: parse_results prefix.solid_kmers_binary.kvalue_file > prefix_file.kvalue

Example:

./multi-dsk/parse_results Example/paired-reads.solid_kmers_binary.60 > paired-reads.60

Step 4: Generate the De Bruijn graph

(Fig. 1D)

Generating the graph needs two files and a parameter. This will combine paired files into a single file.

  1. fasta_file containing all the reads.
  2. kmer_file generated above.
  3. threshold value for ignoring erroneous k-mers.

Command: perl construct_graph.pl fasta_file kmer_file threshold graph_file "s"

Example:

perl construct_graph.pl  Example/paired-reads.fasta paired-reads.60 0 paired-reads.60.graph "s"

Step 5: Create the paired set file

(Fig. 1D)

Create the paired set using the paired reads. It takes as input the two paired files, file1.fasta and file2.fasta, the k-mer counts file file1.kvalue, and a threshold for ignoring erroneous k-mers

Command: perl construct_paired_without_bloom.pl -file1 file1.fasta -file2 file2.fasta -paired -kmerfile file1.kvalue -thresh number -wr output_paired_set_file

Example:

perl construct_paired_without_bloom.pl -fasta Example/paired-reads.fasta -kmerfile paired-reads.60
  -thresh 0 -wr paired-reads.60.pk.txt

VIPRA

(Fig. 1E)

Step A: Running VIPRA

Running the VIPRA algorithm takes inputs generated above and a parameter for the average insert size, threshold parameter and a value for M (factor) which decides the number of paths to generate per vertex

Command: perl dg_cover.pl -graph graph_file_from_step3 -kmer kmer_file_from_step2 -paired paired_set_from_step4 -fact M -thresh threshold_value -IS insert_size > vipra_output_file

Example:

perl dg_cover.pl -graph paired-reads.60.graph -kmer paired-reads.60 -paired paired-reads.60.pk.txt -fact 5
  -thresh 0 -IS 400 > paired-reads.60.fact5.txt

Step B: Generate fasta file

Extracting fasta file from outputfile

Command: perl process_dg.pl vipra_output_file > paths_fasta_file

Example:

perl process_dg.pl paired-reads.60.fact5.txt > paired-reads.60.fact5.fasta

Step C: Generate paths file for maximum likelihood estimation

Extracting just the paths in terms of nodes in the graph

Command: perl get_paths_dgcover.pl -f vipra_output_file -w paths_write_file

Example:

perl get_paths_dgcover.pl -f paired-reads.60.fact5.txt -w paired-reads.60.fact5.paths.txt

VSEARCH clustering

(Fig. 1F)

centroid-based clustering by similarity. A centroid in a cluster is a contig.

Command: vsearch --cluster_fast paths_fasta_file --id --centroids cluster_centroid_sequences --consout cluster_consensus_sequences

Example

vsearch --cluster_fast paired-reads.60.fact5.fasta --id 0.995
 --centroids paired-reads.60.fact5.centroids.0.995.fa
 --consout paired-reads.60.fact5.consout.0.995.fa

MLEHaplo

(Fig. 1G)

(For larger datasets, this step needs parallelism to reduce time cost.)

Running MLEHaplo takes as input intermediate files generated by VIPRA Step A: Running VIPRA and paths generated by VSEARCH VSEARCH clustering, which is a subset of paths_write_file generated in Step C: Generate paths file for maximum likelihood estimation

Command: perl dg_subpath.pl paths_write_file cluster_centroid_sequences subset_paths_file

Example

perl dg_subpath.pl paired-reads.60.fact5.paths.txt
 paired-reads.60.fact5.centroids.0.995.fa paired-reads.60.fact5.0.995.paths.txt 

use this new paths file for likelihood_singles_wrapper.pl

Non Parallel Version Command:

perl likelihood_singles_wrapper.pl -condgraph prefix.cond.graph -compset prefix.comp.txt -pathsfile subset_paths_file -back -gl approximate_genome_size -slow > MLE_textfile_output

Example

perl likelihood_singles_wrapper.pl -condgraph paired-reads.60.cond.graph -compset paired-reads.60.comp.txt
  -pathsfile paired-reads.60.fact5.0.995.paths.txt  -back -gl 1200 -slow  > paired-reads.60.smxlik.txt

Required input files & Parameters

  1. prefix.cond.graph: Condensed Graph file generated by VIPRA .
  2. prefix.comp.txt: Compatible Set generated by VIPRA.
  3. paths_write_file: Paths file from ViPRA.
  4. approximate_genome_size: Approximate genome Length.

Final viral population generation using MLE_textfile_output

Command: perl extract_MLE.pl -f paths_fasta_file -l MLE_textfile_output > MLE_Haplo_fasta_OUTPUT

Example

perl extract_MLE.pl -f paired-reads.60.fact5.fasta
  -l paired-reads.60.smxlik.txt > paired-reads.60.MLE.fasta

paired-reads.60.MLE.fasta is the final output after ViPRA, VSEARCH, and MLEHaplo.

~~ update for the ViPRA-Haplo paper (submitted in 2022) ~~