Closed josiahseaman closed 8 years ago
There are many different cases to handle here. I just discovered one I hadn't thought of: same chr, same strand, but not in synteny.
chain 334143 chr20 61729293 + 8816039 8821476 chr20 64444167 + 8921767 8927528 42449
42 5 0
434 0 1
172 1 0
23 0 1
129 0 1
1129 2 0
184 1 0
38 0 4
90 2 0
382 1 0
117 0 18
31 0 3
3 0 10
105 1398 1695
1129 0 1
19
I guess I should have expected this, but a lot of these alignments are really small.
There's going to be an issue with the inserts being larger than the gap allowed for them.
First visualization of in-chromosome translocations. I'm using Green as a delimiter right now. Looks like there's about 400,000bp worth of translocations (probably double that) on Chr20. That's .6% of the chromosome, maybe as high as 1%. Good news is that means we haven't been missing out on much from that source.
Also notice the long tail of green with absurdly short sequences inside. It's aligning 27bp sequences since that's statistically significant, though I doubt it's biologically feasible translocation.
First working reverse complement alignment!
Turns out there are incredibly sparse low quality chains in the file. It looks like 9 individual alignments that just happened (by chance) to have the same synteny across vast distances. So 99% of the displayed sequence is just intervening sequence. I'll need to special case that sort of thing:
So many coordinate frames! I'm trying to put labels on the non-syntenic fragments. This is a bug that I thought was cool looking.
After looking at this some more, I realize this stuttering is caused by newlines in unexpected places. A newline will start a new baseline x value without resetting. So cleaning up the contig style code I abandoned will fix the stuttering too.
Followup #12. Mostly depends on #15.
Currently, 8bcb6f87740f2b00c5a0d7c81e6006d3a56b4127 DDV is only reading in alignments from the "main" chain entry. This is the biggest entry that covers most of the chromosome, in order, and assuming the same chromosome strand. However there's a bunch of other chain files which have smaller alignments that pull in sequence from disparate locations. There should be a switch that allows the user to select whether or not they want to pull in these secondary chain entries as well.