seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
398 stars 81 forks source link

[IO] Is there a neat way to pass only selected fields to SAM output? #2436

Open tloka opened 3 years ago

tloka commented 3 years ago

Platform

Question

Hey all,

I am struggling a bit with properly writing unmapped reads to my output alignment files (which is a common feature I would like to have integrated into my software).

I understand that I can write records using emplace_back(...) and specifying all fields. I also know that it is in principle possible to provide less arguments to this function as long as they are not required. So, if seqan3::field::id was the first field specified for the sam output, I could call emplace_back("read1") which would set the ID only and leave all other fields unset. However, there is no way to do the same if the id was the last element in the field list, so if you for example have (minimal example):

seqan3::sam_file_output<
   seqan3::fields<
      seqan3::field::seq,
      seqan3::field::qual,
      seqan3::field::cigar,
      seqan3::field::mapq,
      seqan3::field::ref_id,
      seqan3::field::ref_offset,
      seqan3::field::id>,
   seqan3::type_list<seqan3::format_sam, seqan3::format_bam>,
   std::vector<std::string>> sam_out(/*Init parameters; not important here*/);

it is no longer possible to create an entry having seqan3::field::id set only. Even when setting all other fields manually to default values, SeqAn would throw an error when trying to set seqan3::field::ref_id to "" or "*" as it is not contained in the list of references.

This behavior currently makes it complicated for me to have mixed files of mapped and unmapped reads, where in the latter case I can't provide a valid ref_id. Only workaround for me is currently to change the order of fields, but I am afraid this could in future lead to incompatibility if I need to mix even more combinations of (un-)reported fields.

Thus, my question is if I have overseen a (neat) way to fill the same seqan3::sam_file_output object with entries using different selections of fields?

If no, it would be great if there was a possibility to have some function that allows to speficy the fields to set, e.g. something like this:

sam_out.emplace_back_selected<seqan3::fields<seqan3::field::id>>("Read 1");

which would be valid as long as seqan3::field::id is part of the sam file output object.

Second option, there would be something like an unset value (or similar) to each of the fields that can be used to set fields that are not available to default value and at the same time are not caught by validation steps as, for example, with the reference ID:

sam_out.emplace_back(seqan3::field::ref_id::unset, "Read 1");
marehr commented 3 years ago

Hi tobias,

so if I understand you correctly, you want to switch between record-types (one mapped, one unmapped) in the output?

I don't know if I caught your use case, but do you want to do:

seqan3::sam_file_output<...> out{"./some_path"};

// write out a full record as you already do (field order shouldn't matter)
out.emplace_back(...);

// we just provide the id, everything else is default
using unmapped_read_record = seqan3::sam_record<seqan3::type_list<std::string>, seqan3::fields<seqan3::field::id>>;
unmapped_read_record unmapped{"READ 1"};

out.push_back(unmapped);

(UNTESTED, I still need to compile it :))

We are currently working on it to make the customisation via those fields obsolete. I'm currently pitching the following idea:

seqan3::sam_file_output out{"./some_path"}; // no template arguments anymore, as they are only needed for emplacing the data

// note: you don't specify ANY fields anymore.
seqan3::sam_record record{.id = "read1"}; // other fields have sensible default types and values
record.id = "read1"; // or later as member assignment
recod.id(); // to access the data. 

out.push_back(std::move(record)); // only set field read1
tloka commented 3 years ago

Hey Marcel,

thanks for your response! I think the first solution is actually catching my case, I wasn't aware that I can push records with a different set of specified fields. This should solve it for now, I will try it out and let you know. However, the template types of sam_record could get pretty long in some cases (as I need to specify a type_list and field_list), which is not a big issue, but a bit unpleasant.

Thus, the new planned solution sounds way more comfortable and would definitively be easier to use. I actually thought about asking whether a removal of the template arguments would be possible, but I wouldn't dare as I expect it to be a bunch of (structural) changes. I am really looking forward for this to be finished!

tloka commented 3 years ago

Hey,

thanks again for your answer, it compiled and worked as you proposed. However, especially when working with views / ranges to sequences, it is not trivial to declare the correct types in the type list. I actually only managed to make it work using decltype. Thus, i am happy for now that there is a working solution for arbitrary combinations of fields, but still I am looking forward to the new version with no need to declare any types!

Closed this thread, as the provided answer works.

marehr commented 3 years ago

(Reopened, as I think we should add documentation for this use case.)