seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
412 stars 82 forks source link

Problems with sam_file_output #3299

Closed eseiler closed 2 weeks ago

eseiler commented 3 weeks ago

Error 1 was found and reported by @tloka (via mail).

#include <seqan3/io/sam_file/all.hpp>

template <int number>
void error()
{
    using sam_format_type_list_t = seqan3::type_list<seqan3::format_sam, seqan3::format_bam>;
    using sam_file_output_t = seqan3::sam_file_output<typename seqan3::sam_file_output<>::selected_field_ids,
                                                      sam_format_type_list_t,
                                                      std::vector<std::string>>;

    if constexpr (number == 1)
    {
        std::vector<sam_file_output_t> alignment_streams;
        std::vector<std::string> seq_names{"hello", "world"};
        std::vector<int> seq_lengths{1000, 2000};
        alignment_streams.emplace_back(std::ostringstream{}, seq_names, seq_lengths, seqan3::format_sam{});
    }

    if constexpr (number == 2 || number == 3)
    {
        std::vector<size_t> seq_lengths{1000, 2000};
        std::vector<std::string> seq_names{"hello", "world"};
        std::vector<sam_file_output_t> alignment_streams;
        if constexpr (number == 2)
        {
            alignment_streams.reserve(3);
        }

        for (size_t i = 0; i < 3; ++i)
        {
            alignment_streams.emplace_back(std::cout, seq_names, seq_lengths, seqan3::format_sam{});

            std::ranges::for_each(seq_names,
                                  [](std::string & str)
                                  {
                                      str += "foo";
                                  });
            std::ranges::for_each(seq_lengths,
                                  [](size_t & len)
                                  {
                                      ++len;
                                  });
        }
    }

    if constexpr (number == 4)
    {
        std::vector<std::string> seq_names{"hello", "world"};
        std::vector<int> seq_lengths{1000, 2000};
        sam_file_output_t fout1{std::ostringstream{}, seq_names, seq_lengths, seqan3::format_sam{}};
        sam_file_output_t fout2{std::move(fout1)};
    }
}

int main()
{
    std::cerr << "=== Error 1 ===\n";
    error<1>();
    std::cerr << "=== Error 2 ===\n";
    error<2>();
    std::cerr << "=== Error 3 ===\n";
    error<3>();
    std::cerr << "=== Error 4 ===\n";
    error<4>();
}

Background

1)

If no header has been written, the header is written on destruction

https://github.com/seqan/seqan3/blob/4d03890530089b040221876c9e368fc250c4583f/include/seqan3/io/sam_file/output.hpp#L156-L172

2)

The header might not own the reference IDs (here: seq_names):

https://github.com/seqan/seqan3/blob/4d03890530089b040221876c9e368fc250c4583f/include/seqan3/io/sam_file/header.hpp#L59-L70

Errors

Error 1 (Segfault)

Destructors are called in reverse order:

seq_lengths's contents are copied, so this is OK. When destroying alignment_streams, the destructor for each emplaced sam_file_output_t is called. The destructor wants to access the reference IDs. Because the constructor was called with an lvalue reference seq_names, it will try to access seq_names. seq_names has already been destroyed.

Error 2 (Unexpected output)

Expected:

@HD     VN:1.6
@SQ     SN:hellofoofoo  LN:1002
@SQ     SN:worldfoofoo  LN:2002
@HD     VN:1.6
@SQ     SN:hellofoo     LN:1001
@SQ     SN:worldfoo     LN:2001
@HD     VN:1.6
@SQ     SN:hello        LN:1000
@SQ     SN:world        LN:2000

Actual:

@HD     VN:1.6
@SQ     SN:hellofoofoofoo       LN:1002
@SQ     SN:worldfoofoofoo       LN:2002
@HD     VN:1.6
@SQ     SN:hellofoofoofoo       LN:1001
@SQ     SN:worldfoofoofoo       LN:2001
@HD     VN:1.6
@SQ     SN:hellofoofoofoo       LN:1000
@SQ     SN:worldfoofoofoo       LN:2000

Error 3 (Segfault)

Segfault happens on Clang. GCC gives assertion:

/usr/include/c++/14/bits/unique_ptr.h:448: typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp, _Dp>::operator*() const 
[with _Tp = seqan3::sam_file_header<std::vector<std::__cxx11::basic_string<char> > >; 
      _Dp = std::default_delete<seqan3::sam_file_header<std::vector<std::__cxx11::basic_string<char> > > >; 
      typename std::add_lvalue_reference<_Tp>::type = seqan3::sam_file_header<std::vector<std::__cxx11::basic_string<char> > >&]: 
Assertion 'get() != pointer()' failed.

Without the alignment_streams.reserve, a reallocation happens. This will then use the move constructor.

Error 4 (Segfault)

Simplification of Error 3

After a move, the internal unique pointers primary_stream and secondary_stream are nullptr for the moved-from object. On destruction, this moved-from object then tries to write the header.

Solutions

For 1 and 2

Always taking ownership of the reference IDs:

    sam_file_header(ref_ids_type ref_ids) :
        ref_ids_ptr{new ref_ids_type{std::move(ref_ids)}, ref_ids_deleter_default}
    {}

From the user's side, error 1 can be worked around by transferring ownership:

alignment_streams.emplace_back(std::ostringstream{}, std::move(seq_names), seq_lengths, seqan3::format_sam{});
//                                                   ^^^^^^^^^

For 3 and 4

Since moved-from unique_ptrs are guaranteed to be set to nullptr, also check for non-null primary_stream before writing the header.