md5sam / Falcon2Fastg

Falcon2Fastg is a tool for converting a FALCON assembly to FASTG format to visualize with Bandage
MIT License
13 stars 1 forks source link

Falcon2Fastg

This software converts the results of PacBio assembly using FALCON, to a FASTG graph that can be visualized using Bandage.

Usage

python Falcon2Fastg.py [--only-output=reads|contigs]

This can be run in the output directory of FALCON assembly (2-asm-falcon). Please make sure to copy the preads4falcon.fasta file from the intermediate directory (1-preads_ovl) to the output directory (2-asm-falcon)

Falcon2Fastg needs the following 6 input files:

Dependencies :

Biopython (available at http://biopython.org/wiki/Download)

pyfaidx (available at https://github.com/mdshw5/pyfaidx)

Quick installation of dependencies:

pip install biopython pyfaidx  # add --user if you don't have root

Output :

The output of the tool is two FASTG files (reads.fastg and contigs.fastg) that can be opened with Bandage.

Additionally, the tool produces a CSV file : ReadsInContigs.csv that can be loaded with Bandage. This labels the reads according to the contigs that they are a part of, along with the mapping position within the contig.

Alt text

Above is a sample Bandage visualization of a reads.fastg file generated by Falcon2Fastg from a FALCON assembly (a plant mitochondrial genome).

Zooming in on a smaller set of nodes shows the edges in black, connecting the colored nodes :

Alt text

For benchmarking, Falcon2Fastg was run on the preads4falcon.fasta and sg_edges_list file produced by the E.coli test dataset provided with the Falcon install. Instructions on obtaining the dataset are here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Execution of Falcon2Fastg took 2 minutes on a desktop computer (size of preads4falcon.fasta: 449 MB).

The figure below represents a visualization of this E. coli data.

Alt text

Contigs visualization

Falcon2Fastg can also be used to visualize the contigs produced by FALCON, and overlaps between them. The contig graph is created in contigs.fastg. By default, Falcon2Fastg will output this file. You can choose that it outputs only the reads graph using the --only-output=reads parameter.

To test this visualization mode, we assembled Drosophila melanogaster reads available at:
https://github.com/PacificBiosciences/DevNet/wiki/Drosophila-sequence-and-assembly

The input file was 2.2G in size (dmel_FALCON_preassembled_reads.fasta).

FALCON assembly parameters were not optimized, and were as follows :

length_cutoff = 3000, length_cutoff_pr = 6000, overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20

The final p_ctgs.fa file had 642 contigs with total length ~27 Mbp.

Execution of Falcon2Fastg took 5 minutes on a desktop computer (size of preads4falcon.fasta: 2.2 GB).

The figure below is the visualization of these D. mel. contigs (colors are random)

Alt text

Read density (approximate read coverage)

Bandage provides a way to visualize k-mer coverage, as reported by the assembler. As Falcon is a string graph assembler, it does not report such information. Ideally, to compute the coverage of a contig, one would need to re-map the reads back to the assembled contigs. Here, we report a more simple metric that is easy to compute from the output of Falcon.

Read density is calculated as (sum of length of all reads used by FALCON to construct the contig / length of contig). We believe that variation in read density reflects variation of coverage;

The figure below is a schematic of read density. The blue arrows represent reads that were used by Falcon to create the red (resp. black) contig. The contig above (black) has fewer reads within it. Its read density is around 2.0 The contig below (red) and has more reads within it. Its read density is around 5.0

Alt text

The figure below is the visualization of the same D. mel. contigs, colored by read density.

Alt text

Zooming in shows that bright red represents higher density (6.0x). Contigs colored black have a lower read density (2.0x)

Alt text

Memory Warning

The pyfaidx module is used to read an entire FASTA file into memory. If the size of your preads4falcon.fasta is greater than the amount of available RAM, it is advisable to run this computation on a server with greater available memory.

Caveats :

Any large differences are mostly restricted to short contigs, when one very long read at either extremity can affect the length of the contig.

Testing :

Please see the test/ directory for a small example dataset and output

FALCON can be installed following the instructions here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Other tools

Additional tools for visualizing read overlap can be found in the utils directory. Please consult utils/README.md for details

License

This content is released under MIT License. Please see LICENSE.md for details.

Authors

Primary author : Samarth Rangavittal, The Pennsylvania State University (szr165@psu.edu)

Rayan Chikhi, University of Lille 1

Jean-Stéphane Varré, University of Lille 1