[IO] Parsing for genome alignment formats (MAF and HAL)

bkille commented 2 years ago

Big fan of the Seqan3 library, thanks for all of your hard work! I have been using the MSA parser recently and it has worked quite well, however, the majority of my work centers around genome alignments. It would be extremely useful if there were parsers for MAF and HAL alignment formats.

marehr commented 2 years ago

Hi bkille, thank you for your kind words! :)

Can you give us some links to the file specification and some example files to see what the key features are? Maybe also some context in which field they are generally used?

Do you just need a parser or also a "writer"?

Can you give us the link to the MSA Parser?

Is there something that gets in your way of using the MSA Parser with seqan3? Do you have a code snippet / link to your repo that shows the usage of both together and your use case?

bkille commented 2 years ago

It's the least I could do!

Of course, sorry if I missed documentation on how to open a feature request/proposal.

Can you give us some links to the file specification and some example files to see what the key features are?

Here is the official documentation for MAF, with additional docs at BioPython and more basic docs from PSU. The tldr is that these file formats represent a set of local multiple sequence alignments from the same set of sequences.

Here is the github page for the tool that implements the hierarchical alignment format (HAL). The publication describing the format is here.

FWIW, MAF is still more commonly used than HAL I believe.

Maybe also some context in which field they are generally used?

They are used to represent multiple genome alignments. In global MSA, the two sequences are expected to be "co-linear", meaning that if there is an alignment between base pair i from sequence 1 and j from sequence 2, any alignment between n and m is not allowed if (n < i && m > j) || (n > i && m < j). Essentially all alignments must be non-crossing. Multiple genome alignment (MGA) do not have that restriction. This is because genomes are often rearranged, so in order to capture the full homology it is necessary to let any part of one genome be related to any part of another.

Genome alignments are often used in comparative genomics studies. Here is a recent and very brief publication that describes the most recent method for genome alignment. Comparative genomics studies can identify patterns of evolution/selection and provide a better basis for constructing phylogenies, among many other uses. They are becoming more and more popular due to the increase and quality and quantity of genome assemblies.

Do you just need a parser or also a "writer"?

Frankly both would be ideal.

Can you give us the link to the MSA Parser?

So I haven't been using an MSA parser from Seqan3, so much as I have been using the fasta reader to construct MSAs. A multiple sequence alignment is essentially a set of equivalence groups, so I read in each fasta file and assign the nucleotides to homology groups based on the column. Here is my file which parses two input MSA files and computes the accuracy. It's not necessarily the most efficient or clean, though :stuck_out_tongue:

is there something that gets in your way of using the MSA Parser with seqan3?

Frankly, I wasn't aware of an MSA parser in Seqan3 :grimacing:

smehringer commented 2 years ago

Hi @bkille,

thanks a lot for all your input!

Frankly, I wasn't aware of an MSA parser in Seqan3 :grimacing:

To be clear: There is no MSA parser in SeqAn3 :cry: I guess @marehr thought you are already using another MSA parser from another library.

Here is my file which parses two input MSA files and computes the accuracy.

From a quick sweep over the code (which I think do looks quite clean) it looks like reading in gapped fasta serves your purpose well.

In your application, what would be the ideal output of an MSA Parser? The MSA itself as a container of gapped sequnces that you can iterate over column wise?

Unfortunately though, I don't know how soon we can work on this because our group got diminished quite a bit by people finishing their PhD and we need to see how we can distribute our resources.

bkille commented 2 years ago

In your application, what would be the ideal output of an MSA Parser? The MSA itself as a container of gapped sequnces that you can iterate over column wise?

Personally, I don't think Seqan3 really needs a MSA parser. The fasta reader w/ gapped alphabet should satisfy most needs. If anything, a tutorial entry about using type traits to allow reading gapped sequences would be most helpful.

However, the representation of a genome alignment (MAF file) is a more complicated question. I think the Biopython representation is a reasonable start, and don't see any reason why it should be changed. tldr; two levels of iteration: (1) iterate over alignments in the MAF file, (2) iterate over gapped sequences within an alignment.

Unfortunately though, I don't know how soon we can work on this because our group got diminished quite a bit by people finishing their PhD and we need to see how we can distribute our resources.

Ahh, I'm sorry to hear that :slightly_frowning_face:. Considering that I will without a doubt be in need of a MAF parser for my PhD, I would be happy to help contribute! After all, if I don't get it from Seqan3, I will have to write it myself.

smehringer commented 2 years ago

Ahh, I'm sorry to hear that :slightly_frowning_face:. Considering that I will without a doubt be in need of a MAF parser for my PhD, I would be happy to help contribute! After all, if I don't get it from Seqan3, I will have to write it myself.

That's a most generous offer we'd love to take! I'll put it up for discussing a plan in our next team meeting.

What would be your time frame with this project? Would you want to tackle this immediately or would starting out in the new year be fine?

bkille commented 2 years ago

New year is definitely fine, not in a terrible rush.

For some context wrt C++, I worked on implementing this C++ proposal w/ Vincent Reverdy (a member of France's C++ ISO committee).

The requirements for the project required that each group of algorithms in <algorithms> be in a separate file. Some notable files are: https://github.com/vreverdy/bit-algorithms/blob/master/include/bit_algorithm_details.hpp https://github.com/bkille/bit-algorithms/blob/rai-overloads/include/shift.hpp https://github.com/vreverdy/bit-algorithms/blob/master/include/reverse.hpp

seqan / seqan3

[IO] Parsing for genome alignment formats (MAF and HAL) #2903