josiahseaman / FluentDNA

FluentDNA allows you to browse sequence data of any size using a zooming visualization similar to Google Maps. You can use FluentDNA as a standalone program or as a python module for your own bioinformatics projects.
65 stars 7 forks source link

Use all Alignment information from Chain Files #14

Closed josiahseaman closed 8 years ago

josiahseaman commented 8 years ago

Followup #12. Mostly depends on #15.

Currently, 8bcb6f87740f2b00c5a0d7c81e6006d3a56b4127 DDV is only reading in alignments from the "main" chain entry. This is the biggest entry that covers most of the chromosome, in order, and assuming the same chromosome strand. However there's a bunch of other chain files which have smaller alignments that pull in sequence from disparate locations. There should be a switch that allows the user to select whether or not they want to pull in these secondary chain entries as well.

josiahseaman commented 8 years ago

23 will be possible once this is done. We could have a command line argument for the types of information used. Mark this in the visualization HTML for reproducibility.

josiahseaman commented 8 years ago

There are many different cases to handle here. I just discovered one I hadn't thought of: same chr, same strand, but not in synteny.

Same strand, but no synteny

chain 334143 chr20 61729293 + 8816039 8821476 chr20 64444167 + 8921767 8927528 42449
42  5   0
434 0   1
172 1   0
23  0   1
129 0   1
1129    2   0
184 1   0
38  0   4
90  2   0
382 1   0
117 0   18
31  0   3
3   0   10
105 1398    1695
1129    0   1
19
josiahseaman commented 8 years ago

I guess I should have expected this, but a lot of these alignments are really small. image

josiahseaman commented 8 years ago

There's going to be an issue with the inserts being larger than the gap allowed for them. image

First visualization of in-chromosome translocations. I'm using Green as a delimiter right now. Looks like there's about 400,000bp worth of translocations (probably double that) on Chr20. That's .6% of the chromosome, maybe as high as 1%. Good news is that means we haven't been missing out on much from that source.
image Also notice the long tail of green with absurdly short sequences inside. It's aligning 27bp sequences since that's statistically significant, though I doubt it's biologically feasible translocation.

josiahseaman commented 8 years ago

First working reverse complement alignment! image

Turns out there are incredibly sparse low quality chains in the file. It looks like 9 individual alignments that just happened (by chance) to have the same synteny across vast distances. So 99% of the displayed sequence is just intervening sequence. I'll need to special case that sort of thing: image

josiahseaman commented 8 years ago

image So many coordinate frames! I'm trying to put labels on the non-syntenic fragments. This is a bug that I thought was cool looking.

josiahseaman commented 8 years ago

After looking at this some more, I realize this stuttering is caused by newlines in unexpected places. A newline will start a new baseline x value without resetting. So cleaning up the contig style code I abandoned will fix the stuttering too.