seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
396 stars 81 forks source link

sam tags functions #3236

Closed notestaff closed 2 months ago

notestaff commented 4 months ago

1) How do I test whether a BAM record has a given tag? 2) What happens if a tag value type in a BAM record does not match the expected tag value type configured by specializing sam_tag_type<>? (Both if this happens when reading a tag value, and if this happens when writing a tag value.) 3) Is there a way to iterate over all tags in a BAM record? Thanks for help! @smehringer

eseiler commented 4 months ago
  1. How do I test whether a BAM record has a given tag?

The tag_dictionary is a std::map, so you can use the things like contains().

#include <seqan3/io/sam_file/sam_tag_dictionary.hpp>

using namespace seqan3::literals;

template <>
struct seqan3::sam_tag_type<"XX"_tag>
{
    using type = int32_t;
};

int main()
{
    seqan3::sam_tag_dictionary dict{};
    dict.get<"XX"_tag>() = 3;

    std::cout << std::boolalpha;
    // User-defined tag
    std::cout << "XX tag present: " << dict.contains("XX"_tag) << '\n';
    // Standard tag
    std::cout << "NM tag present: " << dict.contains("NM"_tag) << '\n';
    // Unkown tag
    std::cout << "ZZ tag present: " << dict.contains("ZZ"_tag) << '\n';
}

// Output:
// XX tag present: true
// NM tag present: false
// ZZ tag present: false
  1. What happens if a tag value type in a BAM record does not match the expected tag value type configured by specializing sam_tag_type<>? (Both if this happens when reading a tag value, and if this happens when writing a tag value.)

When writing it, it should be hard compiler error (unless implicit conversion happens?)

#include <seqan3/io/sam_file/sam_tag_dictionary.hpp>

using namespace seqan3::literals;

template <>
struct seqan3::sam_tag_type<"XX"_tag>
{
    using type = int32_t;
};

int main()
{
    seqan3::sam_tag_dictionary dict{};

    dict.get<"XX"_tag>() = "hello"; // error: invalid conversion from ‘const char*’ to ‘int’
}

I haven't tested it, but when reading, the value is in binary, and the given tag-type describes how to parse it. So in general, there is not much to check.

We do check invalid values, e.g., invalid casts, or overflow. For the type H, we also check that there is an even number of bytes because that's what the spec requires.

We also check the format of the tags, i.e., [TAG][TYPE_ID][VALUE], where TAG are two chars, TYPE_ID is one char.

  1. Is there a way to iterate over all tags in a BAM record? Thanks for help! @smehringer

You can iterate over the dictionary:

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/sam_file/sam_tag_dictionary.hpp>

std::string string_from_tag(uint16_t const tag)
{
    // Tag is encoded as char1 * 128 + char2
    const char char_1 = static_cast<char>((tag >> 8) & 0x00FF);
    const char char_2 = static_cast<char>(tag & 0x00FF);
    return {char_1, char_2};
}

int main()
{
    using namespace seqan3::literals;

    seqan3::sam_tag_dictionary dict{};

    dict.get<"NM"_tag>() = 3;
    dict.get<"CO"_tag>() = "helloworld";

    // sam_tag_dictionary is a std::map<uint16_t, std::variant<...>>
    for (auto & [tag, value] : dict)
    {
        seqan3::debug_stream << string_from_tag(tag) << ' ' << value << '\n';
    }
}

// Output:
// CO helloworld
// NM 3

And to do something with value, you would need to use std::visit. The debug_stream provides an overload for printing variants. For a custom std::visit it depends on what you want to do with the tag value.

notestaff commented 4 months ago

Thanks a lot for the fast and detailed answer!

when reading, the value is in binary, and the given tag-type describes how to parse it

By "given tag-type" you mean the type read from the BAM, right? What happens if that type differs from the type configured for that tag by specializing seqan3::sam_tag_type?

eseiler commented 4 months ago

So, for example, when you have a tag XX as integer, write a BAM file; then the type of XX, and read the BAM file? Or there just happens to be a tag with sam_tag_type specialization, but it originates from another application?

When reading, XX will still be parsed as integer and stored as such.

If you use sam_tag_dictionary::get for access to XX, you'll get a std::bad_variant_access, because you are requesting the new XX type, but an integer was parsed.

If you use the dictionary like a map, and access XX (dict["XX"_tag]/dict.at("XX"_tag)), you will get a variant that currently holds an int.

Outside sam_tag_dictionary::get, sam_tag_type is not used.

So, in summary:

After all, multiple applications could use the same user-defined tag but store different things in it.

rrahn commented 2 months ago

Hi @notestaff,

I hope all your questions have been addressed. I will be closing this issue for now to keep our issue tracker organized.

However, if you feel that your questions have not been fully answered or if you have further queries related to this topic, please feel free to reopen this issue. We are more than happy to continue the discussion.