isgilman / Portulaca-amilis-genome

3 stars 2 forks source link

Portulaca amilis genome annotation: genomics of carbon concentrating mechanisms

Introduction

This is a final project for the Comparative Genomics seminar in the spring of 2019. The main goal of this project is production of a high quality annotation of the recently sequenced Portulaca amilis genome. Secondary goals include characterizing gene family expansion of key photosynthetic genes (e.g. PEPC) and comparing our geneome, in general, and these gene families, in particular, to other C3, C4, and CAM taxa.

Goals

  1. Annotate the P. amilis genome
  2. Characterizing PEPC gene family expansion
  3. Comparing PEPC gene family expansion to publicly available C3, C4, and CAM genomes

The data

  1. P. amilis genome
    • Source: Edwards Lab, sequenced and assembled by Dovetail Genomics
    • Format: .fasta
    • Status: ~408Mbp assembled into 9 scaffolds representing the 9 P. amilis chromosomes
  2. P. amilis transcriptome
    • Source: reference ID ERR2040261
    • Format: .fasta
    • Status: unassembled raw reads
  3. PEPC alignments
    • Source: Edwards Lab and 1KP Initiative
    • Format: .fasta
    • Status: alignment of hundreds of ppc contigs from many species representing many paralogs
  4. Genome size and quality metrics for other angiosperms

Methods

For a step-by-step walkthrough, refer to the wiki. Here is a brief outline of the methodology.

  1. Sequencing and assembly by Dovetail genomics
  2. Quality control using QUAST, gVolante, and BUSCO
  3. Evidence gathering
    • Transciptome assembled with genome-guided Trinity
    • Proteomes from Beta vulgaris and Arabidopsis thaliana
    • Coding sequences extracted from assembled transcripts using Transdecoder with Beta vulgaris and Arabidopsis thaliana proteomes
    • Repeat libraries extracted using RepeatModeler and masked with RepeatMasker
  4. Initial genome annotation with MAKER
  5. Train ab initio gene predictors SNAP and Augustus
  6. Genome annotation with MAKER and ab initio gene prediction
  7. Iterate training gene predictors and annotation until stabilization
  8. Infer homology of final gene models

Results

The main result of the project is itself the annotated genome, which is currently a work in progress. At this time I have finished the initial annotation, trained gene model predictors, and begun reannotating using these predictors. Along the way I have recorded a number of important statistics about the size, completeness, and content of the P. amilis genome.

Size distribution

Total length (nt) 403885173
Longest sequence (nt) 53436919
Shortest sequence (nt) 1000
Mean sequence length (nt) 99651
Median sequence length (nt) 1476
N50 sequence length (nt) 42597560
L50 sequence count 5
Number of sequences > 1K (nt) 4046 (99.8% of total number)
Number of sequences > 10K (nt) 32 (0.8% of total number)
Number of sequences > 100K (nt) 13 (0.3% of total number)
Number of sequences > 1M (nt) 9 (0.2% of total number)
Number of sequences > 10M (nt) 9 (0.2% of total number)
Sum length of sequences > 1M (nt) 395389203 (97.9% of total length)
Sum length of sequences > 10M (nt) 395389203 (97.9% of total length)

Genome completeness

My first annotation with MAKER recovered a large fraction of those found in the raw scaffolds. I expected that the number of duplicated genes would be higher in this annotation because we retained isoforms for many gene from the transcriptome analysis during annotation.

BUSCO Initial MAKER Input scaffolds
Complete BUSCOs 1031 (71.6%) 1291 (89.7%)
Complete and single-copy BUSCOs 814 (56.5%) 1228 (85.3%)
Complete and duplicated BUSCOs 217 (15.1%) 63 (4.4%)
Fragmented BUSCOs 158 (11.0%) 29 (2.0%)
Missing BUSCOs 251 (17.4%) 120 (8.33%)
Total BUSCO groups searched 1440 1440

I also combined some data on genome size and completeness from Zhao and Schranz (2019) to see how our assembly compares to other angiosperm assemblies. The P. amilis genome is among the highest quality (at least in terms of N50 and BUSCO) of publically available angiosperm genomes.

Genome content

After the initial MAKER analysis I recovered 23,893 predicted genes with a mean length of 3,661.38 bp. I estimated the genome to be constituted of ~46% repetitive elements.

Next steps

The next computational steps in annotating the Portulaca amilis genome include

In addition to expanding my computational methodology, I am also in the process of generating more molecular data to bolster our annotation. In particular, I am generating transcriptomes from multiple tissue types of P. amilis under normal environmental conditions that should increase the completeness of our genome annotation. I am also generating long-read transcriptomes for normal and drought-stressed leaves, to understand the full diversity of photosynthesis-related transcripts.

References