maickrau / ribotin

MIT License
30 stars 2 forks source link

ribotin

rDNA consensus sequence builder. Input hifi or duplex, and optionally ultralong ONT. Extracts rDNA-specific reads based on k-mer matches to a reference rDNA sequence or based on a verkko or hifiasm assembly, builds a DBG out of them, extracts the most covered path as a consensus and bubbles as variants. Optionally assembles highly abundant rDNA morphs using the ultralong ONT reads.

Installation

Installation with conda is recommended. Use conda install ribotin.

Compilation

Also needs MBG version 1.0.13 or more recent, GraphAligner and Liftoff.

Usage

Quick start with CHM13 test dataset:

Download the test dataset. Unzip with tar -xzf ribotin_testdata.tar.gz and then run ribotin-ref -x human -i data/hifi_reads.fa --nano data/ont_reads.fa -o output. This will run ribotin-ref on the test dataset and store the results in the folder output. The assembled morphs will be in output/morphs.fa which describes the morph sequences as well as their ONT coverages.

The dataset also has example results with ribotin version 1.2 in result_v1.2 and instructions for replicating them in README.

Reference based:
bin/ribotin-ref -x human -i hifi_reads1.fa -i hifi_reads2.fq.gz --nano ont_reads.fa -o output_folder

This extracts rDNA-specific reads based on k-mer matches to human rDNA, builds a graph and a consensus, and finds variants supported by at least 3 reads. --nano is optional, if it is present then ribotin also builds morph consensuses. Results are written to output_folder.

Verkko based (automatic):

First you must run a whole genome assembly with verkko. Then run:

bin/ribotin-verkko -x human -i /path/to/verkko/assembly -o output_folder_prefix

The folder in parameter -i should be the same folder as verkko's parameter -d. This finds the rDNA tangles based on k-mer matches to human rDNA and assembly graph topology, extracts HiFi reads uniquely assigned to each tangle, and for each tangle builds a graph and a consensus and finds variants supported by at least 3 reads and builds morph consensuses. Results are written per tangle to output_folder_prefix[x] where [x] is the tangle number.

Verkko based (manual):

First you must run a whole genome assembly with verkko. Then manually pick the nodes in each rDNA tangle from assembly.homopolymer-compressed.noseq.gfa. Each tangle should only have the rDNA tangle nodes, and not the surrounding nodes. Save the tangle nodes to files with one tangle per file eg node_tangle1.txt, node_tangle2.txt, node_tangle3.txt. Format of the node tangle files should be eg utig4-1 utig4-2 utig4-3... or utig4-1, utig4-2, utig4-3... or each node in its own line. Then run:

bin/ribotin-verkko -x human -i /path/to/verkko/assembly -o output_folder_prefix -c node_tangle1.txt -c node_tangle2.txt -c node_tangle3.txt

This extracts HiFi reads uniquely assigned to each node tangle, and for each tangle builds a graph and a consensus and finds variants supported by at least 3 reads and build morph consensuses. Results are written per tangle to output_folder_prefix[x] where [x] is the tangle number.

Hifiasm based (automatic):

First you must run a whole genome assembly with hifiasm. Then run:

bin/ribotin-hifiasm -x human -a /path/to/hifiasm/assembly_prefix -o output_folder_prefix -i hifi_reads.fa --nano ont_reads.fa

The prefix in parameter -a should be the same prefix as hifiasm's parameter -o, and the reads in -i and --nano should be the same reads that were given to hifiasm. This finds the rDNA tangles based on k-mer matches to human rDNA and assembly graph topology, extracts HiFi reads uniquely assigned to each tangle, and for each tangle builds a graph and a consensus and finds variants supported by at least 3 reads and builds morph consensuses. Results are written per tangle to output_folder_prefix[x] where [x] is the tangle number.

Hifiasm based (manual):

First you must run a whole genome assembly with hifiasm. Then manually pick the nodes in each rDNA tangle from <assembly>.bp.r_utg.noseq.gfa. Each tangle should only have the rDNA tangle nodes, and not the surrounding nodes. Save the tangle nodes to files with one tangle per file eg node_tangle1.txt, node_tangle2.txt, node_tangle3.txt. Format of the node tangle files should be eg utig4-1 utig4-2 utig4-3... or utig4-1, utig4-2, utig4-3... or each node in its own line. Then run:

bin/ribotin-hifiasm -x human -a /path/to/hifiasm/assembly_prefix -o output_folder_prefix -i hifi_reads.fa --nano ont_reads.fa -c node_tangle1.txt -c node_tangle2.txt -c node_tangle3.txt

This extracts HiFi reads uniquely assigned to each node tangle, and for each tangle builds a graph and a consensus and finds variants supported by at least 3 reads and build morph consensuses. Results are written per tangle to output_folder_prefix[x] where [x] is the tangle number.

Nonhumans

For running ribotin-ref on nonhumans replace -x human with --approx-morphsize <morphsize> -r path_to_example_morphs.fa where <morphsize> is the estimated size of a single morph (45000 for human) and path_to_example_morphs.fa is a fasta/fastq file which contains (partial or complete) example morphs from the same sample or species. It does not matter if the example morphs are complete or fragments as long as most rDNA k-mers are present. You can get partial reference morphs by doing a whole genome assembly with hifi reads using MBG or a similar hifi based assembly tool, and extracting the sequences of the rDNA tangle from the assembly.

For ribotin-verkko and ribotin-hifiasm, replace -x human with either --approx-morphsize <morphsize> --guess-tangles-using-reference path_to_example_morphs.fa if you have an example morphs file, or with --approx-morphsize <morphsize> -c tangle1.txt -c tangle2.txt if you have manually located rDNA tangles from a verkko or hifiasm assembly, where tangle1.txt tangle2.txt etc. are files with manually selected rDNA tangle nodes from the verkko or hifiasm assembly with every tangle in a separate file.

If you additionally have one complete morph from the same or related species, you can also include --orient-by-reference previous_reference_single_morph.fa to have the results in the same orientation (forward / reverse complement) and offset (rotation) as the previous reference. This file should contain exactly one complete morph.

Species with short rDNA morphs

If your species has short rDNA morph size (about 10000bp per morph) you can use the HiFi reads to get very accurate morphs. In this case input the HiFi reads as the nanopore reads as well and adjust the clustering parameters: -i hifi_reads.fa --nano hifi_reads.fa --morph-cluster-maxedit 10 --morph-recluster-minedit 1 --approx-morphsize <morphsize> -r path_to_reference_kmers.fa. This will resolve morphs with very small differences.

Clustering morphs with ultralong ONT reads

If you have ultralong ONT reads, you can include them to produce consensuses of highly abundant rDNA morphs similar to the CHM13 assembly. For ribotin-ref and ribotin-hifiasm, add the parameter --nano /path/to/ont/reads.fa (multiple files may be added with --nano file1.fa --nano file2.fa etc). ribotin-verkko will automatically check if ONT reads were used in the assembly and use them, and can be overrode with --do-ul=no. This will error correct the ultralong ONT reads by aligning them to the allele graph, extract rDNA morphs from the corrected reads, cluster them based on sequence similarity, and compute a consensus for each cluster. This requires GraphAligner to be installed.

Annotations

You can lift over annotations with the optional parameters --annotation-reference-fasta and --annotation-gff3, for example bin/ribotin-verkko ... --annotation-reference-fasta template_seqs/rDNA_one_unit.fasta --annotation-gff3 template_seqs/rDNA_annotation.gff3. This requires liftoff to be installed. The parameter preset -x human includes annotations from the KY962518.1 rDNA reference.

Output

The output folder will contain several files:

The following files are created when ultralong ONT reads are included: