Generate FASTA from pairwise alignment "chain" files - Githubissues

josiahseaman / FluentDNA

FluentDNA allows you to browse sequence data of any size using a zooming visualization similar to Google Maps. You can use FluentDNA as a standalone program or as a python module for your own bioinformatics projects.

65 stars 7 forks source link

Generate FASTA from pairwise alignment "chain" files #12

Closed josiahseaman closed 8 years ago

josiahseaman commented 8 years ago

Followup #11. DDV needs FASTA files in order to show aligned chromosomes.

josiahseaman commented 8 years ago

A preview of what gaps look like and how the background color helps you identify what is being gapped.

josiahseaman commented 8 years ago

That's something... but the second file is still not gapping correctly.

josiahseaman commented 8 years ago

I had to take the Edison approach to solving this: I found 100 wrong ways to align fasta files and one correct way:

josiahseaman commented 8 years ago

The important but subtle difference between which file you use as the reference frame: you can end up dropping out sequence from your query that is not in the reference. In this example the bracketing sequence around v38 N block is lost on the left where v19 is used as reference.

Now how do I get both in one image?

josiahseaman commented 8 years ago

This is mostly done but there's two features I'd consider:

Find places where the gaps on both sides are next to each other and close to the minimal gap size. This isn't as simply as it sounds because the gaps aren't necessarily on the same line of the chain file. It'll take some additional tracking variables.
When we're opening a gap, check and see if the other sequence is just N's. If it's just N's it's okay to delete the N's and close the gap while maintaining synchrony. Also synchronizing centromere N's together would be nice.

All of this gets a lot more complicated when we start doing multiple chain files...

josiahseaman commented 8 years ago