Alignment of sequences consisting of arbitrary 64 bit integers

paoloshasta commented 4 months ago

I am trying to use Seqan3 to align two sequences consisting of arbitrary 64-bit integers. That is, the alphabet consists of all possible 2^64 64-bit integers. I was able to do this without problems in Seqan2, but after digging in the Seqan3 documentation I was not able to find out how to do this. Can you point me to an example or give me some ideas on how to do this? Your seqan3::custom::alphabet does not seem to do the job, and it does not support 64-bit integers anyway. I know I somehow need to define a new alphabet, but the documentation to do this is a bit opaque and seems to be geared towards small alphabets.

This is needed for the Shasta assembler, in which each read is represented as a sequence of markers. See here for more information.

paoloshasta commented 4 months ago

I got it to the point of having a class Alphabet64 that passes the following static assertions:

static_assert(seqan3::semialphabet<Alphabet64>);
static_assert(seqan3::writable_semialphabet<Alphabet64>);
static_assert(seqan3::alphabet<Alphabet64>);

But I have so far been unable to get it to pass this one:

static_assert(seqan3::writable_alphabet<Alphabet64>);

I am getting the following error messages (simplified for clarity) :

error: static assertion failed
static_assert(seqan3::writable_alphabet<Alphabet64>);
note: constraints not satisfied
required by the constraints of ‘template<class t> concept seqan3::writable_alphabet’
in requirements with ‘t v’, ‘seqan3::alphabet_char_t<alphabet_t> c’ [with t = Alphabet64] the required expression ‘seqan3::assign_char_to(c, v)’ is invalid
seqan3::assign_char_to(c, v)

My Alphabet64 class looks like this so far (I omitted the comparison operators for brevity):

class Alphabet64 {
public:
    uint64_t i;
    using alphabet_char_t = uint64_t;
    static constexpr size_t alphabet_size = std::numeric_limits<uint64_t>::max();

    // Comparison operators here, omitted for brevity.    

    uint64_t to_rank() const
    {
        return i;
    }

    uint64_t to_char() const
    {
        return i;
    }

    Alphabet64& assign_rank(uint64_t const j) noexcept
    {
        i = j;
        return *this;
    }    
};

Please help! Perhaps add a complete example to the documentation showing an alignment of sequences consisting of arbitrary integers?

paoloshasta commented 4 months ago

After adding member function assign_char I was able to get that last assertion to work. But now when I try to do an alignment of two sequences of this alphabet I get the following:

error: template constraint failure for ‘template<class ... alternative_types>  requires ((writable_constexpr_alphabet<alternative_types> && ...)) && ((regular<alternative_types> && ...)) && sizeof ... (alternative_types ...) >= 2 class seqan3::alphabet_variant’
   38 | using gapped = alphabet_variant<alphabet_t, gap>;

eseiler commented 4 months ago

Hey there,

Did you find https://docs.seqan.de/seqan3/main_user/howto_write_an_alphabet.html ?

From a quick glance, semialphabet is not enough because the alignment uses gapped sequences, which require an alphabet.

Unlike other unsigned integer types, uint64_t is not added as alphabet: https://github.com/seqan/seqan3/blob/main/include/seqan3/alphabet/adaptation/uint.hpp

The header names the first reason: There is no char64_t.

However, uint64_t also cannot be a semialphabet because the alphabet size cannot be expressed with fundamental types. It would be std::numeric_limits<uint64_t>::max + 1, which is one more than 64 bit can represent :)

The (semi)alphabet problem with uint64_t doesn't seem trivial to solve, and we'll have to look into it.

Regarding alignment, maybe @rrahn has some idea.

paoloshasta commented 4 months ago

For my purposes it would be sufficient if we can get this to work for 63-bit integers represented as 64-bit integers with the most significant bit set to zero. The size of this alphabet is 2^63 and so is representable as a 64-bit integer.

I did peruse the portion of the documentation you linked to as well as the API reference documentation.

paoloshasta commented 4 months ago

Ok, so now I have an alphabet class that can represent all 63-bit integers and passes all these four static assertions:

static_assert(seqan3::semialphabet<Alphabet63>);
static_assert(seqan3::writable_semialphabet<Alphabet63>);
static_assert(seqan3::alphabet<Alphabet63>);
static_assert(seqan3::writable_alphabet<Alphabet63>);

However, when I try to compute an alignment with this, the compiler complains about the scoring scheme:

seqan3/alignment/pairwise/alignment_configurator.hpp:306:73: error: static assertion failed: Alignment configuration error: Either the scoring scheme was not configured or the given scoring scheme cannot be invoked with the value types of the passed sequences.
  306 |         static_assert(alignment_contract_t::expects_valid_scoring_scheme(),

I am using nucleotide_scoring_scheme, but I probably need t define a custom scoring scheme for this alphabet. Can you point me to documentation sections that could help me with that?

rrahn commented 4 months ago

Hi @paoloshasta,

any class modelling the seqan3::scoring_scheme_for concept should do it. Basically, what it expects is a class with a public type definition with the name score_type that specifies the type of the score returned and a score function that can be invoked in your case with two instances of the Alphabet63 type.

paoloshasta commented 4 months ago

Thank you, that got me one step ahead. I have a scoring scheme that passes this assertion, and with that that problem is gone.

static_assert(seqan3::scoring_scheme_for<Alphabet63ScoringScheme, Alphabet63>);

But now I bump into a new problem:

gapped.hpp:38:7: error: template constraint failure for ‘template<class ... alternative_types>  requires (writable_constexpr_alphabet<alternative_types> && ...)) && ((regular<alternative_types> && ...)) && sizeof ... (alternative_types ...) >= 2 class seqan3::alphabet_variant’
   38 | using gapped = alphabet_variant<alphabet_t, gap>;
      |       ^~~~~~
gapped.hpp:38:7: note: constraints not satisfied

Do I also need to define a new gap class? I looked at the documentation under Gap and alphabet_variant but it is not clear to me what I need to define versus what is provided by Seqan3. And the documentation on alignments does not say anything about this, as far as I can tell.

paoloshasta commented 4 months ago

I could not find anything in the documentation about writable_constexpr_alphabet.

eseiler commented 4 months ago

I don't think it's currently possible to define an alphabet that would work.

Putting the problems with defining uint64_t as an alphabet, the alignment also has some constraints:

It will use seqan3::gapped for the aligned sequences
For some type type, seqan3::gapped<type> is an alias for seqan3::alphabet_variant<type, seqan3::gap>.
seqan3::alphabet_variant requires a character type of char - this is not really a problem for getting it to compile though
seqan3::alphabet_variant will also generate lookup tables for each rank
For uint64_t, there are 2^64 ranks. So no table. This does barely work for 16-bit integers.

So it seems like the alignment only handles small alphabets (up to char). (ping @rrahn)

Regarding your use case, from what I understand, you take a random set of k-mers, each of which gets a rank. For 8000 k-mers that results in an alphabet of size 8000. You then convert your input sequences to a sequence of k-mer ranks (markers). And then you want to align those sequences, which would, for example, be represented by std::vector<uint16_t>. And since your parameters are configurable and your algorithm allows it, it does also work with ranks that are encoded by 64 bit integers.

Does that sound about right?

Edit:

I have one question though. When does the 64 bit alphabet come into play? When converting the original sequence to a marker representation, or when later aligning sequences?

paoloshasta commented 4 months ago

That description of a typical 8000 symbol alphabet size is obsolete and refers to sequencing technology from 5 years ago. Today, the reads are much more accurate and therefore the markers are much longer. They are typically 30 bases in length and without RLE (Run-Length-Encoding), which means that the total number of possible marker k-mers is 4^30 = 2^60. And they will probably need to be longer in the future, which means that I will probably have to switch to 128-bit integers soon. But even if I consider only the distinct marker k-mers that appear in a given assembly, as you suggest, their total number is still very high - high enough that it is not feasible to store a lookup table. In addition, using a lookup table, even if it were possible, would incur a cache miss penalty every time a marker k-mer has to be converted to sequence. Instead, with the current approach that operation is trivial and requires just some bit operations which are very fast.

Regarding your question: when a read is converted to a marker representation, it becomes a sequence of integers (60-bit integers if using marker length 30). Later, when I want to align two reads, instead of aligning their base sequences I align their marker k-mers. Those sequences are typically 20 times shorter and so the alignment is much faster, and in addition the alphabet is much longer than the original 4-symbol alphabet, which means that identical symbols rarely appear in the two reads outside of their actual alignment. This reduces alignment uncertainty, especially in repeat-rich regions.

It is unfortunate that Seqan3 is not able to handle such a large alphabet. I had no problem doing this with SeqAn2: I just add 100 to the integers being aligned, and so there is no possibility of collision with the value (45) used by Seqan2 to represent gaps. And there is no possibility of overflow if all of my integers are actually 60 bits long. However I recently discovered bugs in Seqan2 banded alignments, which cannot be fixed because that code is no longer being maintained. Hence my decision to explore Seqan3.

I may have to look for another solution or write some custom code.

eseiler commented 4 months ago

Thanks for clarifying!

But even if I consider only the distinct marker k-mers that appear in a given assembly, as you suggest, their total number is still very high - high enough that it is not feasible to store a lookup table.

Well, it wouldn't really work (well) anyway for any alphabet with more than 256 letters.

In addition, using a lookup table, even if it were possible, would incur a cache miss penalty every time a marker k-mer has to be converted to sequence. Instead, with the current approach that operation is trivial and requires just some bit operations which are very fast.

Yes, that's another reason. It's only really designed for small alphabets.

It is unfortunate that Seqan3 is not able to handle such a large alphabet.

There will be some modularization in the future, including, as far as I know, alphabet-free alignment. Not sure if it would work for your use case, but it should be way more flexible than seqan3's alginment code.

However I recently discovered bugs in Seqan2 banded alignments, which cannot be fixed because that code is no longer being maintained. Hence my decision to explore Seqan3.

We do maintain Seqan2. We do bug fixes and newer compiler support. However, we don't develop new things for Seqan2 (feature freeze). So feel free to file an issue in Seqan2!

paoloshasta commented 4 months ago

Great, I will do that. Please confirm that this is the correct repository:

https://github.com/seqan/seqan

eseiler commented 4 months ago

Great, I will do that. Please confirm that this is the correct repository:

https://github.com/seqan/seqan

It is

paoloshasta commented 4 months ago

Great. I will soon file an issue on that repository, attaching a simple test program that reproduces the problem.

Thank you for your prompt and competent help on this.

seqan / seqan3

Alignment of sequences consisting of arbitrary 64 bit integers #3271