isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

racon pipeline #69

Open mictadlo opened 6 years ago

mictadlo commented 6 years ago

Hi, last year I ran the below pipeline. In the meantime, minimap2 came out, racon supports FASTA files now and racon_wrapper.py has been developed.

For my new dataset, I have only FASTA files and just wondering whether there is a better way to run Racon pipeline?

minimap -Sw5 -L100 -m0 \
        -t 20 \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        | gzip -1 > plant-gt-10k/plant.paf.gz

miniasm -Rc2 -f /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        plant-gt-10k/plant.paf.gz \
        > plant-gt-10k/plant.gfa

awk '/^S/{print ">"$2"\n"$3}' plant-gt-10k/plant.gfa | fold >  plant-gt-10k/plant.gfa.fasta

# Correction 1

minimap -t 20 plant-gt-10k/plant.gfa.fasta \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        > plant-gt-10k/plant.gfa1.paf

./bin/racon --bq -1 -t 20 \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        plant-gt-10k/plant.gfa1.paf \
        plant-gt-10k/plant.gfa \
        plant-gt-10k/plant.racon1.fasta

# Correction 2

minimap -t 20 plant-gt-10k/plant.racon1.fasta \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        > plant-gt-10k/plant.gfa2.paf

./bin/racon --bq -1 -t 20 \
        /work/waterhouse_team/plant/contamination-free-pacbio/RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fastq \
        plant-gt-10k/plant.gfa2.paf \
        plant-gt-10k/plant.racon1.fasta \
        plant-gt-10k/plant.racon2.fasta

Thank you in advance,

Michal

rvaser commented 6 years ago

Hi Michal, as I can see you are using reads in FASTQ format? If you created them with artificial qualities, you can avoid this step and pass them in FASTA format without the option --bq -1. The rest is fine, although you can use minimap2 with just parameter -x ava-pb when overlapping and only -x map-pb when mapping to reference.

Best regards, Robert

rvaser commented 6 years ago

You can as well add an Illumina polishing step (if you have the data). Use minimap2 with -x sr option and racon with default arguments.

mictadlo commented 6 years ago

Thank you. Do you think the new pipeline make sense:

minimap2 -t 20 -x ava-pb RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta | gzip -1 > plant-gt-10k/plant.paf.gz

miniasm -Rc2 -f RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
        plant-gt-10k/plant.paf.gz \
        > plant-gt-10k/plant.gfa

awk '/^S/{print ">"$2"\n"$3}' plant-gt-10k/plant.gfa | fold >  plant-gt-10k/plant.gfa.fasta

# Correction 1

minimap2 -t 20 -ax map-pb plant-gt-10k/plant.gfa.fasta \
        RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
        > plant-gt-10k/plant.gfa1.sam

./bin/racon -t 20 \
        RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
        plant-gt-10k/plant.gfa1.sam \
        plant-gt-10k/plant.gfa \
        plant-gt-10k/plant.racon1.fasta

# Correction 2

minimap2 -t 20 -ax map-pb plant-gt-10k/plant.racon1.fasta \
        RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta\
        > plant-gt-10k/plant.gfa2.sam

./bin/racon -t 20 \
        RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
        plant-gt-10k/plant.gfa2.sam \
        plant-gt-10k/plant.racon1.fasta \
        plant-gt-10k/plant.racon2.fasta

When should I use racon_wrapper.py?

Thank you in advance.

Michal

rvaser commented 6 years ago

Use the wrapper when your data won't fit into memory. The interface is the same plus two new options: --split and --subsample (check the usage).

Edits to script: pass the reads to both arguments for all-vs-all overlaps in minimap2; pass the backbone in FASTA/FASTQ format to racon; as the fourth argument was dropped from racon, the consensus is written to stdout. The pipeline now looks like this:

minimap2 -t 20 -x ava-pb \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta | gzip -1 > plant-gt-10k/plant.paf.gz

miniasm -Rc2 -f \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
    plant-gt-10k/plant.paf.gz > plant-gt-10k/plant.gfa

awk '/^S/{print ">"$2"\n"$3}' plant-gt-10k/plant.gfa | fold >  plant-gt-10k/plant.gfa.fasta

# Correction 1
minimap2 -t 20 -ax map-pb \
    plant-gt-10k/plant.gfa.fasta \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta > plant-gt-10k/plant.gfa1.sam

./bin/racon -t 20 \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
    plant-gt-10k/plant.gfa1.sam \
    plant-gt-10k/plant.gfa.fasta > plant-gt-10k/plant.racon1.fasta

# Correction 2
minimap2 -t 20 -ax map-pb \
    plant-gt-10k/plant.racon1.fasta \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta > plant-gt-10k/plant.gfa2.sam

./bin/racon -t 20 \
    RSnSeQ_110plant-rm-clean-rmsmrtbell-gt-10k-unique.fasta \
    plant-gt-10k/plant.gfa2.sam \
    plant-gt-10k/plant.racon1.fasta > plant-gt-10k/plant.racon2.fasta