sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

What reads compose contigs/scaffolds #238

Closed necrolyte2 closed 9 years ago

necrolyte2 commented 9 years ago

We are looking to extract the reads that compose the end result contigs and I'm not sure where or how to get these right now since there is no BAM created nor can I find the file that has this info.

Any help is appreciated.

sebhtml commented 9 years ago

There is an option to write an amos file. Otherwise, you can align the reads on the scaffolds and/or contigs.

necrolyte2 commented 9 years ago

Thanks @sebhtml I'll look into the amos option and how to interpret that format to extract what we are looking for

sebhtml commented 9 years ago

An Amos file tells you where each read is located in a contig.

There is a program named sam2amos, see http://bioinformatics.ninja/blog/2013/12/11/genome-scaffolders-suck/ and https://github.com/ctSkennerton/sam2amos

I think that the sam format is more mature and more popular than the amos format.

I suppose that you will want to do operations on ranges and other similar operations.

necrolyte2 commented 9 years ago

Sorry to poke on this some more, but we are kind of stuck a bit on this issue again

We are essentially just trying to extract all reads that assembled to each Contig that is in the produced Contig.fasta file by Ray.

We are now using the -amos flag, which at first seemed like it would get us what we wanted but it seems there are 2 issues with the AMOS.afg file produced by Ray.

  1. Some CTG blocks do not have any {TLE blocks and thus no reads associated with them
  2. All quality data for all contigs and reads are D

For reference here is an AMOS.afg file created by running Ray on some of our data

Example Contig block

{CTG
iid:60
eid:contig-11000003
com:
Software: Ray, MPI rank: 3
.
seq:
ACGTCCTGTTCCAAGGAACTTAGGTCGGTGGTTACTCAGAAGCATCCTCTACAAATTACAACTCGGACGTCGAAGACACCAGATTTCAAATTTGAGCTATTGCTGCTTCACTCGCCGTTACTAGAGCAATCCCTGTTGGTTTCTTTTC
.
qlt:
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
.
}

Not having the quality is not a huge deal, but we are unsure as how to interpret the {CTG blocks that do not have any associated reads.

sebhtml commented 9 years ago

Quality strings only contain D characters because the Ray code does not store read quality into memory. So that is that.

I have not looked at the code that generates the TLE objects in a while. I remember that the code basically visits each vertex of the contig path and thread any read with annotations in the distributed k-mer graph. This code should take care of both strands. Maybe this is another bug lurking in the code.