Team Lead + Project info for participant promotion

NoushinN commented 6 years ago

Hi Everyone, Baraa and I interviewed Sam Chorlton last week and as Baraa posted, we thought it was a great project, needing more refinement at the team lead level. Sam got back to us filling all these fields so we probably won't need to contact him again for more info. His response is pasted here and please also note his optional abstract graphics! :)

Listed Team Leader. Optional: contact email, link to website

Sam Chorlton sam.chorlton@pm.me https://github.com/chorltsd/REUSE

The Hook. In two sentences advertise the goal of your project and why it would be fun to join this particular team

Help change the world by filtering unneeded sequences from a next-generation sequencing dataset, enriching signal from noise and enabling rapid pathogen discovery, isolation of sequence types (eg. rRNA), contaminant removal and more.

Project Abstract: (~300 words) Should have been completed with the application but may be revised. Detail what the project is about and what you're going to do.

Filtering unwanted sequences from nucleic acid sequencing data is an important step in many analyses. It has been used to remove technical artefacts (eg. PhiX), discover known and novel pathogens, isolate nucleic acid types (eg rRNA), and remove noise in metagenomic studies. This step significantly improves the speed and quality of subsequent analyses.

Here I propose an end-to-end pipeline (REUSE) for Rapidly Eliminating Unwanted SEquences from large sequencing datasets. The result of REUSE will be sequences that do not belong to a reference sequence. This pipeline will be based on previously established techniques for isolating known and novel pathogens among sequencing data. It will seek to dramatically speed up the process, optimize flaws in other pipelines, and automate it from start to finish. It will likely include a k-mer filter, read alignment, read assembly, and contig alignment. Some of these steps will be based on publicly available tools, such as RNA-STAR and Trinity, whereas others will need to be programmed from the ground up.

The work at HackSeq18 will focus on development of the most novel and needed module, the k-mer filter (k-REUSE). Previous evidence indicates that k-mers can be used to rapidly screen and filter sequences, and that a k-mer of 21 basepairs is sufficient to discriminate between unrelated species.(1) Currently published applications, such as Kontaminant(1), Cookiecutter(2), BBDuk(3) and others have several limitations, including lack of parallelization, high memory requirements (>50gb for the human genome), and lack of ability to save the reference index to disk. Other techniques, such a read alignment, are too slow to use on large datasets.

The goal of HackSeq18 will be the development of k-REUSE and comparison to other filters. Further development will likely be needed after the hackathon for integration of k-REUSE into the complete REUSE pipeline and ultimate application to extremely large datasets.

Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, et al. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PloS One. 2015;10(6):e0129059.
Starostina E, Tamazian G, Dobrynin P, O’Brien S, Komissarov A. Cookiecutter: a tool for kmer-based read filtering and extraction. bioRxiv. 2015 Aug 16;024679.
Bushnell B. BBTools [Internet]. DOE Joint Genome Institute. [cited 2018 Jul 25]. Available from: https://jgi.doe.gov/data-and-tools/bbtools/

Required Skills: What skills or knowledge is required a priori from the participants to be able to take part on this team

-Strong skills in a fast programming language such as C or C++, or the ability to reference fast libraries from within python or another language. Team members will be needed to program the k-mer based filter. -If you have the above skills, you likely have Github skills. -Understanding the basics of next generation sequencing data and formats (eg FASTA, FASTQ). -Ability to have fun.

Optional Skills: What skills or knowledge would be beneficial for the participants to have but is not necessary to take part.

-Overview of existing pathogen discovery pipelines (RINS, SURPI, Kontaminant, Cookiecutter, DeconSeq). -Ability to write (particularly science). -Ability to science (particularly to evaluate our pipeline in comparison with existing pipelines, generate comparative graphs and tables, and perform basic statistics). abstract graphic

NoushinN commented 6 years ago

Also, please find the github page for my proposal here: https://github.com/NoushinN/anatomy-of-morbidity

alexsweeten commented 6 years ago

Listed Team Leader. Optional: contact email, link to website

Rob Gilmore
robgilmore127@gmail.com
https://github.com/datasnakes/beRi

The Hook. In two sentences advertise the goal of your project and why it would be fun to join this particular team

Are you a bioinformatician frustrated by R's inability to resolve common dependency issues? Help develop beRi, a package management, reproducible workflow, and installation toolkit for the R programming language.

Project Abstract: (~300 words) Should have been completed with the application but may be revised. Detail what the project is about and what you're going to do.

The R programming language currently lacks a standard way to resolve common dependency issues that make it difficult to reproduce various data analysis. Therefore the R community needs a tool or toolkit like Python's pip or JavaScript's npm. beRi ("beri environments for R installations") is a package management, reproducible workflow, and installation toolkit for the R programming language. The project will be developed under 4 separate repositories: renv (virtual environments), rinse (source installation), rut (dependency utils), and beRi (core). This project will primarily be built using Python 3, but will also utilize R and bash. R's documentation will be relied on for guidance.

Required Skills: What skills or knowledge is required a priori from the participants to be able to take part on this team

Willingness to communicate openly.
- Can communicate and conceptualize advance programming paradigms.
- A team player.
- Willing to collaborate with other teams (against the rules?)
- Willing to utilize third party collaboration (against the rules?)
Python 3 (advance level)
- Using virtual environments
- Utilizing sdispater/poetry for project management
R (intermediate level)
- documentation with rmarkdown
Bash (intermediate level)
- Configuring linux software from source
Git/GitHub
- Ability to plan and document code base
Leadership skills.
- 2-3 captains will be needed to help coordinate, and break up responsibilities.

Optional Skills: What skills or knowledge would be beneficial for the participants to have but is not necessary to take part.

Using Slack for Team Management
Technical writing skills
PyCharm/Rstudio/Atom IDE experience
In depth knowledge of how R functions
- devtools
- packrat
- miniCRAN
- tidyverse
Knowledge of specific Python packages
- click
- yaml
- cookiecutter
Photoshop or graphic design experience
Suggests good beers to purchase in Vancouver

Image: beri-hex

alexsweeten commented 6 years ago

Listed Team Leader. Optional: contact email, link to website

Alex Sweeten
alex.sweeten@gmail.com

The Hook. In two sentences advertise the goal of your project and why it would be fun to join this particular team

Interested in developing a novel bioinformatics pipeline? Given the unprecedented size of genomic information, new methods are required to organize and manage this data. The goal of this project will be to develop and test an alignment-free method, and apply it towards datasets of pathogenic organisms.

Project Abstract: (~300 words) Should have been completed with the application but may be revised. Detail what the project is about and what you're going to do.

Determining similarity between bacterial isolates is an important requirement for epidemiological analysis. Alignment-based genomic methods are commonly used to tackle this problem. However, in cases of low sequence homology, horizontal gene transfer, or lack of a priori information, as is common when dealing with pathogenic bacteria, alignment-based methods pose significant problems. The normalized compression distance (NCD) is a parameter and alignment-free distance metric, which has shown recent success in genomics, specifically in classifying viral sequences. Here, I propose to develop a pipeline for allowing users to apply NCD to their genomic data, as well as for users to visualize their results in a presentable and easy to read format. This application will be built using a Python framework and developed into a Conda package and/or Galaxy workflow (depending on time required & participant experience). Furthermore, I propose to test/benchmark our pipeline using datasets of pathogenic bacteria and viral sequences. This project will be a novel community-based application, potentially facilitating further research into alignment-free similarity methods.

Required Skills: What skills or knowledge is required a priori from the participants to be able to take part on this team

Strong skills in Python
GitHub
Bash

Optional Skills: What skills or knowledge would be beneficial for the participants to have but is not necessary to take part.

Data visualization
Parallel programming
Knowledge of developing Conda packages
Knowledge of Galaxy & Galaxy development
Scientific writing
Clustering algorithms

Image:

I'll make one soon ;)

klgray25 commented 6 years ago

Thanks Alex!

NoushinN commented 6 years ago

Hi @klgray25, Cara and Morgan sent this to us (Sasha and I) today. Cara is still looking for he initial application to make adjustments but below is what we could use for now:

Project Title Simulating Transcriptome Structural Variants to Produce Benchmarking Datasets

Team Leader(s) Cara Reisle (caralynreisle@gmail.com) & Morgan Bye (morgan@morganbye.com)

The Hook While there are tools to simulate structural variants in genomic data, currently no such applications exist for transcriptomes. Simulated data is required to properly evaluate structural variant callers.

Project Abstract The goal of this project is to produce an application that is able to simulate structural variants in transcriptome data. The application will be written in python. It will be a pipeline consisting of several steps Specification of the structural variants to be simulated Generation of the resulting transcriptome sequence Simulating paired-end reads covering the structural variant breakpoints Alignment of the simulated reads to the reference genome to produce a BAM file

Required Skills Python (1-star) General programming skills (2-stars) Basic Background knowledge of Genetics and Sequencing (1-star)

Optional Skills (Helpful but not required) Knowledge of structural variants Familiarity with BAM/SAM file format Familiarity with pysam python module

Recommended Reading If you’re new to structural variants or working with sequencing data, it would be a good idea to do a bit of background reading. I’ve listed a some potentially relevant literature below. deFuse (PMID 21625565) Chimerascan (PMID 21840877) STAR (PMID 26334920) BAMSurgeon (PMID 25984700) Flux Simulator (PMID 22962361) MAVIS (PMID 30016509) SQUID (PMID 29650026)

jenjaelin commented 6 years ago

@klgray25 I got this back from Emma (Veena) today:

Project Title: Blockchain and Infectious Diseases Listed Team Leader: vghorakavi@gmail.com The Hook: Infectious diseases are widespread and are difficult to track. Blockchain has the capabilities to track infectious diseases without revealing information about the people impacted and with location data.
Project Abstract: Please use what I submitted before Required Skills: Python: 1, Blockchain: 0, Epidemiology: 1 Optional Skills: Not provided

klgray25 commented 6 years ago

Okay, I'll update the hook for her project! It's a little too late to add another skill requirement though, as that would mess up the response form. I won't be able to add Epidemiology.

hackseq / hackseq18

Team Lead + Project info for participant promotion #20