seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
406 stars 81 forks source link

[IO] Make `seqan3::sam_file_header::program_info_t` easier to copy #3137

Closed tsnorri closed 1 year ago

tsnorri commented 1 year ago

When processing alignments in such a way that both the input and the output are SAM/BAM files, one would typically like to append a @PG record to the SAM file headers and, to that end, copy the existing @PG records. Currently this is not very easy under some circumstances as sam_file_header::program_info_t depends on sam_file_header’s template parameter ref_ids_type (i.e. when the input and output files have a different type for ref_ids_type).

Suggested solutions include:

(I’m not sure what would be the best approach but once one is chosen, I could volunteer to contribute a patch.)

tsnorri commented 1 year ago

Here’s a suggestion:

smehringer commented 1 year ago

Hi @tsnorri,

thanks for reaching out and sorry for the delay.

You are absolutely right. This copy should be way easier. Let me investigate a little and then get back to you. Thanks in advance for volunteering to help!

This copy (for anyone following the thread):

#include <seqan3/io/sam_file/all.hpp>

int main()
{
    seqan3::sam_file_input in{"in.sam"};
    seqan3::sam_file_output out{"out.sam"};

    auto it = in.begin(); // cache first record which also reads the header

    out.header().program_infos = in.header().program_infos; // error :(
}

https://godbolt.org/z/3WrKsMMsb

smehringer commented 1 year ago

Hi @tsnorri,

after a quick dive into the code, we agree with your suggestion.

The tasks would be:

It would be great if you can tackle this. Otherwise just drop a note that you have no time and we can work on this.

Best, Svenja

tsnorri commented 1 year ago

Hi @smehringer and sorry about the delay! I finally had some time to fix the issue; the pull request above should have everything needed.

Best, Tuukka

tsnorri commented 1 year ago

Actually scratch that. I just realised that the test I wrote did not actually test the intended functionality. I’ll re-refine it.

tsnorri commented 1 year ago

I had noticed earlier but forgotten that the solution to this may not be the one that I first thought was the obvious one. If ref_ids_type in the destination is ref_info_not_given, then calling header() will not work, which does make sense. Consequently the easiest way to append a program_info_t is to copy the source header, modify it, and finally reset the header pointer in the records. I think that worked even without decoupling program_info_t, though? Modifying the header object of the output file seemed to be the intended approach at first but I guess either is fine.

In any case, now program_info_t does not depend on sam_file_header’s template parameter and there is a test that checks that copying is possible.