These are all of my notes for the Bladderwort** project from 08-28-2019 forward.
Summary:
In this study, we aim to identify insulator and terminator elements in the Utricularia gibba (bladderwort) genome. This organism is an exceptional model for CRE detection due to its extremely small genome size. The U. gibba genome will be PacBio sequenced and assembled. RNA-seq data will then be used to detect pairs of independently expressed genes in the same genotype. Intergenic regions between independently expressed gene-pairs will subsequently be used for CRE detection. After CREs are detected, they will be validated by a collaborator. A phylogenetic analysis willthen follow in which putative insulator and terminator elements will be gauged for conservation across angiosperms.
Goals:
Generate an Illumina assembly of our genotype of U. gibba. Confirm gene boundaries.
Use 3' RNA-Seq data to find putative CREs in the U. gibba genome.
Assess evolutionary conservation of insulator and terminator sequences across angiosperms.
Genome assembled with spades using Illumina data (MiSeq PE250).
Lots of junk assembled, only 28% of assembly contigs classified with Kraken. However, most of the unclassified contigs were very small.
When the genome is filtered for only U. gibba classified contigs, the genome size is approximately the size of the previous illumina assembly when only contigs larger than 2000bp are used.
Quast results:
- Since this assembly is not the main point of this project, and keeping small contigs in the assembly will not affect my final results, I will leave them in.
Summary of Quast results (spades assembly filtered for only _U. gibba_ scaffolds:
![Screenshot from 2019-09-06 13-45-56](https://user-images.githubusercontent.com/46690580/64448912-bddf6f80-d0ac-11e9-94c9-5e495dc454b4.png)
Summary of Quast results (Published illumina genome compared to published pacbio genome):
![Screenshot from 2019-09-06 13-47-26](https://user-images.githubusercontent.com/46690580/64448996-f2ebc200-d0ac-11e9-9305-b2eb77cb1f3f.png)
#### Summary:
- There are less complete genes in our assembly than in the original assembly. This is to be expected since the original "Illumina" assembly also included 454 reads. The incorporation of 454 reads would allow for a more contigious assembly.
- There are also a lot of rearrangements (or misassemblies) in our genome compared to the illumina reference when compared to the pacbio assembly. This isn't suprising given that the pacbio assembly would include sequence missed by assembling with only illumina data, so most of these rearrangements were probably caused in silico and are in the form of repeat expansions or contractions. (assemblytics result confirms this.):
![strVariant_counts](https://user-images.githubusercontent.com/46690580/64449740-9db0b000-d0ae-11e9-81f5-1502b4e371a2.png)
- There are probably enough genes / intergenic regions to get CRE candidates. Will proceed with annotation.
Using maker for genome annotation. First round is for evidence-based annotation of genes. For evidence, I used:
Proteomes of some asterid species: Actinidia chinensis, Daucus carota, Nicotina attenuata, Helianthus annuus, and Solanum lycopersicum. All proteomes were downloaded from Ensembl plants.
Assembled transcripts from whole mRNA-seq data. ~40k transcripts, each classified as U. gibba by Kraken, many are likely partial.
First round of maker annotation:
22,356 genes found. Around 6k more than what was projected by aligning to genes in the pacbio assembly. Promising!
Next step is training Snap and Augustus for gene prediction using this preliminary gene set (around 21,000 genes, because some could not be extracted using bedtools because +-200 was too long to contain a sequence within the boundaries in each contig, but we got most genes so this should be a good enough training set).
These are all of my notes for the Bladderwort** project from 08-28-2019 forward.
Summary:
In this study, we aim to identify insulator and terminator elements in the Utricularia gibba (bladderwort) genome. This organism is an exceptional model for CRE detection due to its extremely small genome size. The U. gibba genome will be PacBio sequenced and assembled. RNA-seq data will then be used to detect pairs of independently expressed genes in the same genotype. Intergenic regions between independently expressed gene-pairs will subsequently be used for CRE detection. After CREs are detected, they will be validated by a collaborator. A phylogenetic analysis willthen follow in which putative insulator and terminator elements will be gauged for conservation across angiosperms.
Goals: