hackseq / Indicator_contig_predictor

A two-way classifier to characterize metagenomes based on short and long read technologies
1 stars 1 forks source link

Welcome to the project! #1

Closed DCGenomics closed 7 years ago

DCGenomics commented 8 years ago

For those of you who are new to this space, perhaps start here:

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads

For those who are experienced, perhaps mention other relevant tools in this string?

I know this is a bit simplistic, but I would like to get the discussion going!

Also, I'll likely set up a gitter or google group for us to pass info in the next few days. Comments?

Cheers!

xfhxfyyaxw commented 8 years ago

Many thanks:)

JustinChu commented 8 years ago

What is our input? A set of contigs (also, what type of assembly tool & sequencing data is used?) and PB reads? Are the reads CCS?

Overall, the project seems like it will probably be straightforward if we use existing methods:

  1. Using a scaffolding tool like LINKS or SSPACE-LongRead and
  2. gap filling with PBJelly

Do we have test datasets with a ground truth (known reference genome)? Or should we generate simulated datasets first?

If we are generating test datasets we could use PBsim. It is less clear what we can use for simulating a set of contigs (biases in gaps depends on the type of technology used).

syhackseq2016 commented 8 years ago

Thanks @DCGenomics to initiate the discussion! And yes, I think gitter or slack would be great for team communication

I am relatively new to this field. May I ask is the goal of the project to develop a pipeline to glue existing de novo assembly softwares for bacterial genomes? Would the pipeline be assessed by its accuracy or speed (or both)?

syhackseq2016 commented 8 years ago

@JustinChu Great questions! That's also what I wanted to ask :-)

JustinChu commented 8 years ago

@syhackseq2016 Unless we have performance issues where it takes more than an hour to run our code, I'd say we shouldn't worry too much about performance. We can optimize after we get a pipeline working and generating decent results.

For evaluating assemblies commonly Quast is commonly used. We might also want to generate a dotplot or something to show we at least scaffold correctly.

syhackseq2016 commented 8 years ago

@JustinChu Thanks!

Just found this post under https://github.com/hackseq/hackseq_projects_2016/issues/1 about some extra info about our project.

jmicrobe commented 8 years ago

Greetings,

Great questions so far. Most of my experience is with 16S rRNA amplicons, and lately I've been focusing on reproducible pipeline practices. I have experience with GNU Make, Snakemake, and I'm eager to learn/work with docker. I'm familiar with slack and google groups for communication, but I'm open to any tool 😀.

I'll arrive in town late on the 14th and would be free to meet after that.

Jess

JustinChu commented 8 years ago

@hochoy Though I'm in Vancouver, many members of the team are not going to be in Vancouver until right before ASHG. Alternatively, we could have those members meet up with us electronically.

@syhackseq2016 Thanks for that link. It is a good resource though it sort of brings in more questions than answers. They seem to be bringing up some assembly algorithms (that use either pure PacBio or a hybrid of PacBio and Illumina) in the discussion, but the project description says genome closing. I think this could be resolved once the project lead (@DCGenomics) lets us know what out expected input/output is.

I've noticed people are introducing themselves a bit here. My name is (unsurprisingly) Justin and I am a PhD student in the Bioinformatics Technology Lab at the Genome Science Centre (GSC). I work on algorithm development for sequence classification, de novo assembly and other sequence analysis tasks. Though the GSC mostly deals in Illumina sequencing, I've worked a bit with long read technology (mostly Oxford nanopore but some PacBio as well).

I work mostly in C++, R, Perl and Python. I also have experience with make and also recommend make for our initial pipeline. I shamefully haven't really heard of snakemake until now but upon looking at it now I think learning it could be very useful.

Justin

hamzakhanvit commented 8 years ago

With PacBio SMRT reads as input, this is what I have in mind -

PacBio SMRT reads --> Hierarchial genome assembly process(HGAP) with end trimming and bestn < a coverage threshold(~20X) --> minimus2 to connect contigs --> Quiver for polishing (SMRT sequencing reads and the initial de novo assembly are the inputs to Quiver) -->FGAP to close gaps --> trim one end of the self-similar ends for each contig owing to the circular nature of bacterial plasmids and genomes--> Quast to evaluate the assembly. 'Gepard' dotplotting tool for dotplots.

Going with @JustinChu suggestion, we could use 'make' for our pipeline. Waiting for @DCGenomics for a more clearer outline.

Related papers and materials to look into - Liao et al. Completing bacterial genome assemblies: strategy and performance comparisons Chin et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data PacBio Training:Finishing Bacterial Genomes

BTW,

My name is Hamza Khan and I am a MSc student in the CIHR Bioinformatics training program at UBC. I work at the Bioinformatics Technology Lab(BTL) at the Genome Scienc Centre in Vancouver on designing and implementing algorithms for sequence analysis. I choose Python over other languages for my day to day work, though most of my projects are in C++. I use R for plotting/data visualisation.

Cheers!

DCGenomics commented 8 years ago

Good evening team!  Your enthusiasm is awesome, and it makes me think there is a high likelihood of not only finishing a useful software product, but also sending out a cool manuscript. 

That said, in my opinion, what makes a hackathon like this strong, is a bunch of diverse opinions pushing and pulling to create a software product that converts clever algorithms to make an easy to use pipeline covering as many use cases as possible.  I suspect that about half of us could take three days and hack something together that would work well for a few use cases, but I would be willing to bet a Dom perignon to a miller lite that what we can come up with collectively could be much better.

The original use case was "I am a biologist and I have tried to close a bacterial genome with some short reads. It didn't work (repeats, high gc, etc), and I'm thinking of sending it out for long read sequencing, but I want to know that when I get the reads back I can reassemble with high accuracy".

That said, there are a ton of caveats to which tools to use for this. Here's a specific example: say those long reads came from a pac bio-having core. Depending on how much pac bio one got back, one might consider different assemblers. Obviously, there are folks on our team with a ton of experience in this space.

Also, the use case above isn't the only one in this space. There are lots of other use cases centered around bacterial [meta]genome assembly with short and long reads.

Taken collectively, my opinion is that this should be fairly technically straightforward. We can likely make an outline the first morning and be hammering the crap out of Amazon's servers mid-day 2. What would make this software great is to make it as useful as possible.

So, what I propose is that our homework for the next two weeks is to go and talk to our friends and see what they want to do in this space and use that to build a master spec sheet based on that, such that more people will use and hopefully contribute to this repo.

Cheers!

PS -- I'm getting some real sample data from a colleague at a place that does some incredible sequencing work. Perhaps other folks have people who would like to give them data for this effort?

sjackman commented 8 years ago

I've created a Slack at hackseq.slack.com, created a channel #project10, and invited everyone.

sjackman commented 8 years ago

Hi, David. I've sent you another invitation. The first invitation was sent to your e-mail address at prostatecentre.com

DCGenomics commented 8 years ago

Thanks, Shaun!

DCGenomics commented 8 years ago

(also, could you tell me which email address you used for me?)

Cheers!

Ben

sjackman commented 8 years ago

I sent the invitation to you at nih.gov

DCGenomics commented 8 years ago

Hi, if anyone is not on the slack channel yet, please jump on!

I've set up a couple of planning docs, at it would be awesome to get a list of things we are going to want on the AWS nodes in advance (theres already a bunch of stuff on this string)!

You guys are awesome!

Ben

sjackman commented 8 years ago

Pilon is popular. It's intended to correct variants. http://software.broadinstitute.org/software/pilon/

KAT compares the k-mer histogram of the reads to that of the assembly. https://github.com/TGAC/KAT

"REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison." http://www.sanger.ac.uk/science/tools/reapr

RAMPART is an assembly pipeline that includes an evaluation step. https://github.com/TGAC/RAMPART