hackseq / 2016_project_9

Selection of tag SNPs for an African SNP array by LD and haplotype based methods
2 stars 1 forks source link

Greetings! #1

Open tommycarstensen opened 7 years ago

tommycarstensen commented 7 years ago

Hi all!

I just wanted to introduce myself ahead of Vancouver. I work at the Wellcome Trust Sanger Institute. I'm currently involved in projects generating resources for medical genetics in Africa; such as SNP arrays, haplotype reference panels, transcriptomics resources, computational infrastructure, etc.

Feel free to introduce yourselves :) 👍

Best wishes, Tommy

awreynolds commented 7 years ago

Hey everyone! I am a graduate student at the University of Texas at Austin. I am currently working with ancient and modern genomic datasets to address population history questions for Native American communities in the US and Mexico.

I have recently moved into the bioinformatics side of things, so I am looking forward to working with all of you.

-Austin

dfornika commented 7 years ago

Hi, my background is in molecular biology & medical genetics. I studied the population genetics of the aging process at the Genome Sciences Centre here in Vancouver. I'm currently working at Fusion Genomics where we're developing assays for infectious diseases.

I've done various small-scale analyses and programming tasks but have never worked on a group bioinformatics project like this. I'm really looking forward to this chance to collaborate and share our skills & ideas.

ameintjes commented 7 years ago

Hi all

I'm a software developer/bioinformatician at the University of Cape Town, South Africa. We're currently working on a chip design project within the H3Africa Consortium, so looking forward to meeting everyone and sharing ideas.

Ayton

shaze commented 7 years ago

Dear all

I work in bioinformatics group at the Sydney Brenner Institute for Molecular Bioscience at the University of the Witwatersrand, Johannesburg. We are part of one of the H3Africa projects, AWI-Gen, exploring genetic and environmental factors in cardio-metabolic disorders in African population. We’re collaborating on a consortium-wide project looking at population structure based on recently sequenced genomes that came of of H3A projects. I also a teach a range of computing courses.

Scott

marciam commented 7 years ago

Hi!

I have a background in human genetics, though most of my work involved the relatively ancient techniques of positional cloning and homozygosity mapping in family studies. I currently work at Genome BC in Vancouver, overseeing all of our funded research projects. I've been trying to improve my skills and knowledge in population genomics since receiving my 23andMe results a few years ago and am expecting to receive my personal exome results for analysis any day now (through Genos Research). I've never participated in a hackathon before and my bioinformatics/coding skills are rusty, so if anyone has any pointers on the most useful ways to prepare for October, they'd be very welcome.

Cheers, Marcia

dfornika commented 7 years ago

if anyone has any pointers on the most useful ways to prepare for October, they'd be very welcome.

@marciam: Are you familiar with Codecademy? I'm not sure which programming language(s) or tools we'll be using for this project, but they've got a couple lessons on the linux command-line and git that might be a good starting point:

I've done their Javascript and Python lessons, and I like the way that they're broken down into small pieces that can be picked up and worked on briefly.

sjackman commented 7 years ago

Also http://rik.smith-unna.com/command_line_bootcamp/ and https://try.github.io

tommycarstensen commented 7 years ago

I'm not sure which programming language(s) or tools we'll be using for this project

Can I suggest that we use Python? I've created a poll to make it democratic: http://doodle.com/poll/up9s7nk4n5v7c43e

tommycarstensen commented 7 years ago

It seems everyone has voted for Python3. Let's go for that then?

Here is some suggested reading ahead of the Hackseq:

http://www.ncbi.nlm.nih.gov/pubmed/17827206 TAGster: efficient selection of LD tag SNPs in single or multiple populations. Bioinformatics, 2007, Xu et.al.

http://www.ncbi.nlm.nih.gov/pubmed/21903159 Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics, 2011, Hoffmann et.al.

http://www.ncbi.nlm.nih.gov/pubmed/25470054 The African Genome Variation Project shapes medical genetics in Africa. Nature, 2015, Gurdasani et.al. Specifically Supplementary Note 13 on page 124 of the supplementary info: http://www.nature.com/nature/journal/v517/n7534/extref/nature13997-s1.pdf

I suggest we write some code, which does what TAGster does, except we make it applicable to whole genome sequence data from multiple populations and make it accept a white list and black list of SNPs and some list of "SNP scores" for deciding between two otherwise equally good tag SNPs. Does that sound like a good plan?

As input data I will try to get access to the phased sequence data variant calls from the African Genome Variation project, which covers approximately 300 samples from ethnic groups in Ethiopia, Uganda and South Africa. Otherwise we can use the 1000G data. If someone can contribute a white list (pre-selected SNPs) and/or a black list (SNPs that are not suitable to go on an array) and/or SNP scores (e.g. from a commercial vendor such as Illumina or Affymetrix) that would be perfect. Does that all sound good or should we take a different approach?

dfornika commented 7 years ago

I suggest we write some code, which does what TAGster does, except we make it applicable to whole genome sequence data from multiple populations and make it accept a white list and black list of SNPs and some list of "SNP scores" for deciding between two otherwise equally good tag SNPs. Does that sound like a good plan?

It sounds good to me. Which file format(s) would be our input? Will we be starting with vcf files?

jocelynjyl commented 7 years ago

Hello all,

Apologies for my belated introduction. My first degree was in microbiology, and I worked as a lab tech for a few years at the Genome Sciences Center. That got me interested in bioinformatics and I'm now back at school now doing a second degree in computer science. I'm really excited to be part of a bioinformatics hackathon and looking forward to meeting everyone.

Other than Python 3, are there any other programming environment related things recommended for us to set up beforehand?

Best, Jocelyn

alyeffy commented 7 years ago

Hi everyone! Didn't notice this until now, so here's my introduction:

I'm currently in my last year at UBC, completing a combined major in Microbiology and Computer Science. I've worked some co-op terms involving bioinformatics, data analysis and statistics and have gained a proficiency in R from them. In terms of genome-specific projects I have worked with Gene Ontologies from the yeast genome. I'm really interested in the overlap of life sciences with computer science and I am hoping to pursue further education and a career in that in future. I also have experience with Agile software development and participated in my first hackathon recently (lumohacks). I'm excited to be a part of this hackathon as I have never heard of any genomics-related ones until now. Other than R, I have experience with Java, C++, Javascript, HTML5, CSS, SQL and Shell scripting. I haven't played around with Python that much but I am definitely able to pick it up. Looking forward to meeting you all :)

Alyssa

vmon588 commented 7 years ago

Hello everyone. I have been checking the slack page thinking we were communicating over there. I am up to speed with Python. TAGster idea sounds good. Any luck on the SNP lists? Does this qualify? http://archive.broadinstitute.org/mpg/snap/doc.php#SnpDataSet http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html

-Vincent

tommycarstensen commented 7 years ago

Hi all. I've been busy preparing results for my poster for ASHG and putting it all together (I still am); probably like the rest of you. I am looking forward to meeting you all at the Hackseq despite not being as prepared, as I wanted to. I'll try to put together a presentation with elements I hope for us to cover. I imagine 3-5 distributed tasks. 1) Calculation of LD values with existing code (e.g. PLINK) or new Python code (medium task). 2) Selection of tag SNPs from phased (and unphased) data with new code (big task). 3) Automated calling of the LD based method and IMPUTE2 in each cycle of a hybrid algorithm, if we choose to go down this path. Making all elements of the code play well together. (big task) 4) Evaluation of the hybrid and strictly LD based algorithm (big task). 5) Evaluation of our algorithm (speed and accuracy) against existing algorithms; e.g. TAGster (big task). 6) Write a paper (medium task). Whether we will make it through all of these tasks or not I don't know. I doubt it, but let's be ambitious instead of running out of things to do on the last day.

It would be great, if everyone can read the papers I mentioned on Sep10, but I can totally understand, if you are busy with your ASHG talks and posters.

It sounds good to me. Which file format(s) would be our input? Will we be starting with vcf files?

Yes, let's start with VCF files. Let's use the 1000G dataset as input. It can be downloaded here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502

Other than Python 3, are there any other programming environment related things recommended for us to set up beforehand?

I've asked the meeting organisers to install IMPUTE2, but I'm quite confident, that we get can the tools we need quickly. I'm trying to get us all access to the National Center for Supercomputing Applications.

I have been checking the slack page thinking we were communicating over there.

I learned about the Slack page two days ago. I have asked the organisers for an invitation.

TAGster idea sounds good. Any luck on the SNP lists? Does this qualify? http://archive.broadinstitute.org/mpg/snap/doc.php#SnpDataSet

We are going to do our own LD based hybrid tag SNP selection from scratch.

http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html

I suggest we use 1000G phase 3, unless someone has a better suggestion.

shaze commented 7 years ago

Tommy

I’ll check about whether we can use the request list for the H3A project.

Scott