BrunoGrandePhD commented 7 years ago

Post your lay and scientific abstracts below. Aim for a maximum of 250 words for both. Thanks!

dbrazel commented 7 years ago

Project 4:

Pseudo-WGS variant calling for common cell types integrating NGS data from multiple assays

Carolyn Ch'ng, David Brazel, Karthigayini Sivaprakasam, Jill Moore, Shobhana Sekar, Stephen Kan, Jing Yun Alice Zhu, Ka-Kyung Kim, Luca Pinello

Recent progress in sequencing technologies have led to a massive production of epigenetic datasets for many cell lines, thanks mainly to big consortia like ENCODE. This resource has helped tremendously to understand the role of non-coding variants and to prioritize the search of putative causal SNPs through cell-type-specific functional regions such as active or repressed promoters, enhancers, insulators, repressed or open chromatin. On the other hand we still miss genotype information for many cell types profiled limiting the power of allele specific alignments and further analysis that require the right reference genome.To fill this gap we propose a novel pipeline called baklavaWGS. This pipeline recovers genotype information such as SNPs for common cell types aggregating ChIP-seq, RNA-seq, DHS and other sequencing data already produced and available from consortia like ENCODE and Roadmap Epigenomics. Although single assays for each cell line don't have enough coverage in many regions of the genome, aggregating sequencing information for all the available assays for each cell type provides enough power to recover variants with high confidence.We evaluated the quality of our approach using a common cell line (NA12878) for which benchmark data was available from the Genome in a Bottle project. This effort will provide .VCF files for many cell types ready to use for the community, mining already available and public data.

dfornika commented 7 years ago

Project 9

Selection of tag SNPs for an African SNP array by LD and haplotype based methods

Ayton Meintjes, Scott Hazelhurst, Vincent Montoya, Marcia MacDonald, Jocelyn Lee, Dan Fornika, Brian Lee, Austin Reynolds, Tommy Carstensen

Developing a cost-efficient and representative genotype array with SNPs that provide good coverage across the African continent is key to conducting large-scale medical genetic studies in Africa. The great amount of diversity across Africa has not previously been captured on any commercial SNP array to date. Significant new sequencing projects of African populations are generating rich sources of diversity, which can be used for chip design. However, none of the existing tag SNP selection algorithms for designing SNP arrays are geared towards handling WGS data efficiently. Our challenge was to (1) design an algorithm to quickly identify tag SNPs from WGS data and (2) apply this algorithm to a large African WGS dataset to produce a list of candidate SNPs for a commercial array. Our algorithm combines existing imputation methods with pairwise LD SNP tagging to identify candidate SNPs. It accepts standard Variant Call Format (VCF) files as input, and produces VCF files as output, which aids integration into extended analysis pipelines. By developing an efficient tag SNP selection algorithm that accepts next-generation sequence data, we hope to facilitate the continued improvement of whole-genome SNP assays for genetically diverse populations as new sequence data becomes available.

mathbionerd commented 7 years ago

Project 6 XYalign: Hacking sex chromosome variation

Madeline Couse, Bruno Grande, Eric Karlins, Tanya Phung, Phillip Richmond, Timothy H. Webster, Whitney Whitford, Melissa A. Wilson Sayres

Sex chromosome copy number variations are currently estimated to be as common as 1/400 in the human population. Violations in typical ploidy will affect estimates of genome diversity and variation calling that is required in most clinical genomic studies. Further, mis-alignment of reads between the X and Y chromosomes will affect variant calling. Here we propose a new tool, XYalign, to quickly infer sex chromosome ploidy in NGS data (DNA and RNA), to remap reads based on inferred sex chromosome complement of the individual, and to output quality, depth, and allele-balance across the sex chromosomes.

gracezheng commented 7 years ago

https://github.com/hackseq/2016_project_7/blob/master/README.md

amanjeev commented 7 years ago

https://github.com/hackseq/2016_project_8/blob/master/somatic/ABSTRACT.md

MikeSchnall commented 7 years ago

Team 8x: MetaGenius

Lay Summary

What do soil, ocean water, the forest floor and the human body have in common? They are all teeming with fungi, bacteria and viruses. Collectively, these microbes form communities that impact the environment and human health and disease in ways scientists are just beginning to understand. Studying these microbial communities has been challenging with traditional tools because it is frequently not possible to isolate and grow many of them in the lab. An emerging method is ‘shotgun metagenomics’, or the collective DNA sequencing of the microbes en masse, to jointly determine the set of species and genes present in a sample. While this provides useful data, there are substantial challenges with standard DNA sequencing technologies that limit the information obtained. Standard DNA sequencers can only determine short snippets (100s of DNA bases) of the genomes, making it impossible to know which parts of each microbial genome go together. Without this information, any understanding of the underlying sample is highly limited. We demonstrate the use of a new technology from 10x Genomics, Linked-Reads, which provides very long-range information (10,000s-100,000s of DNA bases) from short-read sequencing data. We show that the Linked-Read technology can be used to separate out the DNA sequences from multiple bacterial species, and put them back together in a manner that would not normally be possible. Methods that build on our prototype will allow for a far better understanding of microbial communities than would otherwise be possible.

Technical Summary

The analysis of short-read derived shotgun metagenomic data presents substantial challenges. While the reads can be assembled into short contiguous segments, the presence of homologous sequences within and among the genomes of the different species highly limits the ability to assemble these segments into anything close to full genomes. Here, we build a prototype assembler that uses 10x Genomics’ Linked-Read data to assemble shotgun metagenomic samples, and apply it to a dataset consisting of DNA derived from a mixture of 5 different bacterial species. Our assembler proceeds in multiple steps. First, it builds an initial assembly using the short-read data alone. Next, contigs from this initialy assembly are extended in a barcode-aware manner, and the relative localization of the contigs on the genomes are inferred by the fraction of barcodes shared between 2 contigs. A representative set of contigs that are inferred to be well-separated are then used to recruit reads by their barcodes and locally reassemble all reads from that region. This results in a set of contigs that are ~20-fold longer than the initial short-read derived contigs. These final contigs are scaffolded by both read-pairs and barcode information into very large scaffolds. Analysis of these scaffolds suggests that local barcode-based reassembly will be able to fill-in contig breaks within each scaffold. Our prototype suggests that with Linked-Reads it hould be possible to obtain highly complete genome assemblies from metagenomic samples.

jmicrobe commented 7 years ago

Project 10: Metagenomic indicator contig predictor

General:

Next Generation Sequencing has redefined the practice of genomic science, and allowed for characterization of complex microbial communities. In this project, we combine the strength of both “long-read” and “short-read” technologies to be able to discriminate between different sets of microbiomes, independent of known genes. For a test case we use male vs female infants, as well as for vaginal versus caesarean births. We anticipate this tool to be useful in expanding tools like human microbiome enterotyping which have primarily relied on marker-gene data.

Scientific:

With the advent of affordable sequencing technology there has been a breakthrough in recent years of microbiome diversity studies. Many of these are solely on limited and often biased 16S rRNA amplicons. With the increase of publicly available, whole metagenomic datasets the scope of environmental classification can be broadened to include larger sequence contigs, including microbial “dark matter”. Here we present mICP, a novel strategy to predict indicator contigs for metagenomic datasets. To predict an initial set of indicator contigs, we used long reads (PacBio) from similar phenotypic (infant gut microbiome) datasets, and mapped short reads (Illumina) from divisible phenotypic groups (male vs female and vaginal vs caesarean birth) to these ‘contigs’, and set thresholds for coverage continuity and depth to define indicator contigs. W

BrunoGrandePhD commented 7 years ago

@sjackman and @mbelmadani: Could you post your abstracts when you have a chance? Thanks!

ababaian commented 7 years ago

Little update: Hope everyone is doing well =D

We're compiling the abstracts/ information about hackseq16 onto the website and a meeting report. Please read over your project summary/abstract and ensure everything is correct (~5 minutes)

Projects page: http://www.hackseq.com/projects16

hackseq16 Summary page: http://www.hackseq.com/hackseq16/

Also: @mike10x and @sjackman and @jmicrobe / Ben. Can you please provide names for the participants?

sjackman commented 7 years ago

@ababaian Project 2: ParetoParrot. Craig Glastonbury, Daisie Huang, Hamid Younesy, Jasleen Grewal, Laura Gutierrez Funderburk, Lisa Bang, Shaun Jackman, Veera Manikandan Rajagopal, Y. Brian Lee

hackseq / October_2016

Lay and Scientific Abstracts #76

Project 10: Metagenomic indicator contig predictor

General:

Scientific: