isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

running racon in parallel on split fasta files #92

Open pjm43 opened 6 years ago

pjm43 commented 6 years ago

I have a moderately sized genome (1 Gb) plant genome that I'm trying to polish with 60X illumina reads using racon. I know I can use racon_wrapper to split the genome to avoid excessive memory requirements. My understanding is racon runs those split fasta files serially. I'm wondering if I can speed up the process by running those fasta files in parallel. I'm wondering if you might have any suggestion how to do this? Can I use the original sam file (illumina reads mapped to the whole draft genome) for each of the subset of the target fasta sequences? Are there any "other" concerns with doing this that I'm not recognizing?

Thanks in advance.

rvaser commented 6 years ago

Hi Jeff, if your data won't fit in your memory then you should use the wrapper. Although it runs the splitted files sequentially, each run is parallelized by feeding the threads with single windows rather than contigs. The only reason why it is a bit slower is that it must parse all input files several times (which equals the number of splitted files). If you would run multiple runs in parallel, you would get the behaviour of default racon (but might experience more memory consumption as each run would load all the Illumina reads into the memory).

Best regards, Robert

pjm43 commented 6 years ago

Robert,

Thanks for the quick reply! You'll have to forgive my ignorance, I'm not sure I really understand "feeding the threads with single windows rather contigs". I'm running this on our cluster and I guess I assumed racon would be quicker - after 24 hours it only appears to have processed one of the 10 splits made by racon_wrapper.

It seems like I could run all the split fasta in parallel by launching individual jobs for each of the 10 split fastas - thus cutting my time from what seems like 10 days to run to ~1 day. I was going to run each of the split fastas with racon as follows:

!/bin/bash

SBATCH --time=72:00:00 # walltime

SBATCH --ntasks=24 # number of processor cores (i.e. tasks)

SBATCH --nodes=1 # number of nodes

SBATCH --mem-per-cpu=21G # memory per CPU core

module purge module load racon

racon -u -t 24 illumina_reads.fq.gz reads.sam canu_assembly_0.fasta > canu_assembly_0_racon_polished.fasta

Any thoughts on this strategy?

Thanks, Jeff

rvaser commented 6 years ago

Which version of racon are you running? I doubt that there might be a problem because it took quite long to process one file.

If you have multiple nodes, you could run each split file on separate nodes in parallel. You can as well manually split the target sequence file and use the original SAM file as you suggested in your first post.

pjm43 commented 6 years ago

I'm running v1.3.1.

I have access to several nodes - I'll give that a go.

Thanks,

Jeff

rvaser commented 6 years ago

Which commit? If you don't have the latest one then you might have the error causing an infinite loop.

pjm43 commented 6 years ago

The commit I have is: f5af1c7 I think this is the latest. Jeff

jelber2 commented 5 years ago

How one might do your task @pjm43

# assumes you have newest version of BBMap, minimap2, racon, and GNU Parallel in path
# assumes genome you want to polish is genome.fasta and the reads are illumina-reads.fq (interleaved)

## Step 1 Interleave FASTQ files with reformat.sh from BBTools
reformat.sh in1=illumina-reads-1.fastq addslash=t spaceslash=f \
trimreaddescription=t in2=illumina-reads-2.fastq out=illumina-reads.fastq 2> interleave-fastq.log

## Step 2 split assembly into 99 parts
wget http://kirill-kryukov.com/study/tools/fasta-splitter/files/fasta-splitter-0.2.6.zip
unzip fasta-splitter-0.2.6.zip
perl fasta-splitter.pl --n-parts 99 genome.fasta --out-dir racon-split/

## Step 3 map the illumina reads to genome.fasta assembly
minimap2 -t 75 -x sr genome.fasta illumina-reads.fastq \
> illumina-reads-mapped-to-genome.fasta.paf \
2> illumina-reads-mapped-to-genome.fasta.log &

## Step 4 split the PAF file into 99 parts (number of splits has to match Step 1 above) by grabbing only
### the contigs present in the split FASTA file

## a make a list of the samples (01,02,03,...,99.)
cd racon-split/
seq -w 1 99 > samples

### b get the scaffold/contig names in a separte file
while read i;do
grep ">" genome.part-${i}.fasta |perl -pe "s/>//g" > genome.part-${i}.contigs
done < samples

### c make a list of the files that have the contig names
ls *.contigs > partcontiglist

### d split illumina-reads-mapped-to-genome.fasta.paf
###   into 99 parts
split -d -n 99 ../illumina-reads-mapped-to-genome.fasta.paf

### f make a list of those parts
ls x* > illumina-reads-mapped-to-genome.fasta.paf-parts

### g read through the file containing a list of files with contig names for split FASTA file one line at a time
###   the part LC_ALL=C forces sorting to by bitwise and thus faster
###   then filter the PAF file (which contains all contigs) but only keep the contig names
###   that are present in the split FASTA files
###   The part with parallel uses "GNU Parallel" to split the job across the all cores of the machine.
###   This in essence makes a PAF file for each split made to genome.fasta
###   but it does not allow for reads to map to multiple contigs if you were to map each
###   split of genome.fasta separately.
while read part;do
  export LC_ALL=C
  part2="$(echo $part |perl -pe "s/genome.part-(\d+).contigs/\1/g")"
  export part2
  while read i;do
    export i
    parallel --no-notice 'grep -P "\t${i}\t" {} > illumina-reads-mapped-to-genome.part-${part2}-{}.fasta.paf' < illumina-reads-mapped-to-genome.fasta.paf-parts
  done < $part
done < partcontiglist

### h concatenate the temporary files then delete them
while read i;do
   cat illumina-reads-mapped-to-genome.part-${i}-???.fasta.paf > illumina-reads-mapped-to-genome.part-${i}.fasta.paf
   rm illumina-reads-mapped-to-genome.part-${i}-???.fasta.paf
done < samples

### i switch back to default language
export LC_ALL=en_GB.UTF-8

## Step 5 error correct/polish each split made to genome.fasta

#####beginning of run-racon-on-multiple-nodes.py#####
#! /usr/bin/env python

# SLURM cluster job submission in Python2
# "run-racon-on-multiple-nodes.py"
# By Jean P. Elbers
# jean.elbers@gmail.com
# Last modified 13 November 2018
###############################################################################
Usage = """

run-racon-on-multiple-nodes.py - version 1.0

Usage (execute following code in InDir):

python run-racon-on-multiple-nodes.py --samples samples

"""
###############################################################################
import os, sys, subprocess, re , argparse

class FullPaths(argparse.Action):
    """Expand user- and relative-paths"""
    def __call__(self, parser, namespace, values, option_string=None):
        setattr(namespace, self.dest, os.path.abspath(os.path.expanduser(values)))

def is_dir(dirname):
    if not os.path.isdir(dirname):
        msg = "{0} is not a directory".format(dirname)
        raise argparse.ArgumentTypeError(msg)
    else:
        return dirname

def get_args():
    """Get arguments from CLI"""
    parser = argparse.ArgumentParser(
            description="""\nRuns tadpole error correct on each sample in samples""")
    parser.add_argument(
            "--samples",
            required=True,
            action=FullPaths,
            help="""sample file in form sample1\nsample2\nsample3"""
        )
    return parser.parse_args()

def main():
    args = get_args()
    samples = args.samples
    InDir = os.getcwd()
    OutDir = InDir
    InFile = open(samples, 'r')
    rawdatalist = []
    for line in InFile:
        x = line.strip("\n").split("\t")
        rawdatalist.append(x)
    InFile.close()
    for item in rawdatalist:
        Sample = item[0]
        # Customize your options here
        JobName  = "racon-%s" % (Sample)
        Output   = "racon-stdout-%s" % (Sample)
        Error    = "racon-stderr-%s" % (Sample)
        Walltime = "72:00:00"
        Tasks = "1"
        Nodes = "1"
        CPUs = "24"
        Queue = "serial"
        Mem = "21G"
        Command = """
        module purge
        module load racon
        racon -t 24 ../illumina-reads.fastq \
        illumina-reads-mapped-to-genome.part-%s.fasta.paf \
        genome.part-%s.fasta > racon.illumina.part-%s.fasta""" % (Sample,Sample,Sample)

        JobString = """#!/bin/bash -l
#SBATCH -J %s
#SBATCH -o %s
#SBATCH -e %s
#SBATCH -t %s
#SBATCH -n %s
#SBATCH --nodes=%s
#SBATCH --mem-per-cpu=%s
#SBATCH --mem=%s

        cd %s
        %s\n""" % (JobName, Output, Error, Walltime, Tasks, Nodes, CPUs, Queue, Mem, InDir, Command)

        #Create pipe to sbatch
        proc = subprocess.Popen(['sbatch'], shell=True,
          stdin=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True)
        (child_stdout, child_stdin) = (proc.stdout, proc.stdin)

        #Print JobString
        jobname = proc.communicate(JobString)[0]
        print JobString
        print jobname

if __name__ == '__main__':
    main()
#####end of run-racon-on-multiple-nodes.py#####

# run in the directory with samples
python run-racon-on-multiple-nodes.py --samples samples
# or
python2 run-racon-on-multiple-nodes.py --samples samples

## Step 6 Combine the polished genome.fasta parts into a genome.fasta.polished.by.racon.fasta file
# in the directory with samples
cat racon.illumina.part-*.fasta > ../genome.fasta.polished.by.racon.fasta
pjm43 commented 5 years ago

Thanks! Very much appreciated.