Portulaca amilis genome annotation: genomics of carbon concentrating mechanisms

Introduction

This is a final project for the Comparative Genomics seminar in the spring of 2019. The main goal of this project is production of a high quality annotation of the recently sequenced Portulaca amilis genome. Secondary goals include characterizing gene family expansion of key photosynthetic genes (e.g. PEPC) and comparing our geneome, in general, and these gene families, in particular, to other C3, C4, and CAM taxa.

Goals

Annotate the P. amilis genome
Characterizing PEPC gene family expansion
Comparing PEPC gene family expansion to publicly available C3, C4, and CAM genomes

The data

P. amilis genome
- Source: Edwards Lab, sequenced and assembled by Dovetail Genomics
- Format: .fasta
- Status: ~408Mbp assembled into 9 scaffolds representing the 9 P. amilis chromosomes
P. amilis transcriptome
- Source: reference ID ERR2040261
- Format: .fasta
- Status: unassembled raw reads
PEPC alignments
- Source: Edwards Lab and 1KP Initiative
- Format: .fasta
- Status: alignment of hundreds of ppc contigs from many species representing many paralogs
Genome size and quality metrics for other angiosperms
- Source: Zhao and Schranz (2019) supplementary material
- Format: .csv
- Status: NA

Methods

For a step-by-step walkthrough, refer to the wiki. Here is a brief outline of the methodology.

Sequencing and assembly by Dovetail genomics
Quality control using QUAST, gVolante, and BUSCO
Evidence gathering
- Transciptome assembled with genome-guided Trinity
- Proteomes from Beta vulgaris and Arabidopsis thaliana
- Coding sequences extracted from assembled transcripts using Transdecoder with Beta vulgaris and Arabidopsis thaliana proteomes
- Repeat libraries extracted using RepeatModeler and masked with RepeatMasker
Initial genome annotation with MAKER
Train ab initio gene predictors SNAP and Augustus
Genome annotation with MAKER and ab initio gene prediction
Iterate training gene predictors and annotation until stabilization
Infer homology of final gene models

Results

The main result of the project is itself the annotated genome, which is currently a work in progress. At this time I have finished the initial annotation, trained gene model predictors, and begun reannotating using these predictors. Along the way I have recorded a number of important statistics about the size, completeness, and content of the P. amilis genome.

Size distribution

Total length (nt)	403885173
Longest sequence (nt)	53436919
Shortest sequence (nt)	1000
Mean sequence length (nt)	99651
Median sequence length (nt)	1476
N50 sequence length (nt)	42597560
L50 sequence count	5
Number of sequences > 1K (nt)	4046 (99.8% of total number)
Number of sequences > 10K (nt)	32 (0.8% of total number)
Number of sequences > 100K (nt)	13 (0.3% of total number)
Number of sequences > 1M (nt)	9 (0.2% of total number)
Number of sequences > 10M (nt)	9 (0.2% of total number)
Sum length of sequences > 1M (nt)	395389203 (97.9% of total length)
Sum length of sequences > 10M (nt)	395389203 (97.9% of total length)

Genome completeness

My first annotation with MAKER recovered a large fraction of those found in the raw scaffolds. I expected that the number of duplicated genes would be higher in this annotation because we retained isoforms for many gene from the transcriptome analysis during annotation.

BUSCO	Initial `MAKER`	Input scaffolds
Complete BUSCOs	1031 (71.6%)	1291 (89.7%)
Complete and single-copy BUSCOs	814 (56.5%)	1228 (85.3%)
Complete and duplicated BUSCOs	217 (15.1%)	63 (4.4%)
Fragmented BUSCOs	158 (11.0%)	29 (2.0%)
Missing BUSCOs	251 (17.4%)	120 (8.33%)
Total BUSCO groups searched	1440	1440

I also combined some data on genome size and completeness from Zhao and Schranz (2019) to see how our assembly compares to other angiosperm assemblies. The P. amilis genome is among the highest quality (at least in terms of N50 and BUSCO) of publically available angiosperm genomes.

Genome content

After the initial MAKER analysis I recovered 23,893 predicted genes with a mean length of 3,661.38 bp. I estimated the genome to be constituted of ~46% repetitive elements.

Next steps

The next computational steps in annotating the Portulaca amilis genome include

Continuing the annotation process with MAKER
Visualizing genome (e.g. circos) and gene models with a genome browser
Trying another annotation pipeline called funannotate with some additional/updated methodology
- Purging haplotigs
- Updated repeat analysis with Dfam 3.0, which was released after my repeat analysis
- Generating transcriptome evidence with PASA
- Adding GeneMark-ET, another ab initio gene predictor

In addition to expanding my computational methodology, I am also in the process of generating more molecular data to bolster our annotation. In particular, I am generating transcriptomes from multiple tissue types of P. amilis under normal environmental conditions that should increase the completeness of our genome annotation. I am also generating long-read transcriptomes for normal and drought-stressed leaves, to understand the full diversity of photosynthesis-related transcripts.

References

Matasci N., Hung L.-H., Yan Z., Carpenter E.J., Wickett N.J., Mirarab S., Nguyen N., Warnow T., Ayyampalayam S., Barker M., Burleigh J.G., Gitzendanner M.A., Wafula E., Der J.P., dePamphilis C.W., Roure B., Philippe H., Ruhfel B.R., Miles N.W., Graham S.W., Mathews S., Surek B., Melkonian M., Soltis D.E., Soltis P.S., Rothfels C., Pokorny L., Shaw J.A., DeGironimo L., Stevenson D.W., Villarreal J.C., Chen T., Kutchan T.M., Rolf M., Baucom R.S., Deyholos M.K., Samudrala R., Tian Z., Wu X., Sun X., Zhang Y., Wang J., Leebens-Mack J., Wong G.K.-S. 2014. Data access for the 1,000 Plants (1KP) project. Gigascience. 3:17.
Zhao T., Schranz M.E. 2019. Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proc. Natl. Acad. Sci. U.S.A. 116:2165–2174.
Moore A.J., Vos J.M.D., Hancock L.P., Goolsby E., Edwards E.J. 2018. Targeted Enrichment of Large Gene Families for Phylogenetic Inference: Phylogeny and Molecular Evolution of Photosynthesis Genes in the Portullugo Clade (Caryophyllales). Systematic Biol. 67:367–383.
Wang B., Tseng E., Regulski M., Clark T.A., Hon T., Jiao Y., Lu Z., Olson A., Stein J.C., Ware D. 2016. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nature Communications. 7:1–13.

isgilman / Portulaca-amilis-genome

readme