ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

Filtering a SAM/BAM file in a select() block inverts samline order #105

Closed unode closed 5 years ago

unode commented 5 years ago

Given a sam file:

@SQ SN:1_2_gene2051 LN:1547
@SQ SN:626939.HMPREF9443_01578  LN:1547
@SQ SN:4-12192_1_gene91203  LN:1547
ER9 73  4-12192_1_gene91203 1504    0   44M23S  =   1504    0   AGTTAAAATCCGTTTAAAACGTATGGGCGCTAAAAAAGCTCCTTTCTATCGTATTGTTGTTGCAGAC HHGIIJIIGGGHF@HIJJJIGDHGG>;@DFGHGGIJIIGCEGHHFECHF@DEDC@C>CB=AAC>;:A NM:i:0  MD:Z:44 AS:i:44 XS:i:44 SA:Z:g944564.1178,1508,+,28S39M,0,2;
ER9 377 1_2_gene2051    1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44
ER9 377 626939.HMPREF9443_01578 1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44

processed with:

ngless "0.10"

mapped = samfile('reinject.sam')
mappedBlock = select(mapped) using |mr|:
    mr = mr.filter(min_match_size=1, min_identity_pc=1, action={drop})
    if not mr.flag({mapped}):
        discard

write(mappedBlock, ofile="output-mapblock.sam")

mappedCall = select(mapped, keep_if=[{mapped}])

write(mappedCall, ofile="output-mapcall.sam")

the content of output-mapcall.sam is identical to the input:

% cat output-mapcall.sam 
@SQ SN:1_2_gene2051 LN:1547
@SQ SN:626939.HMPREF9443_01578  LN:1547
@SQ SN:4-12192_1_gene91203  LN:1547
ER9 73  4-12192_1_gene91203 1504    0   44M23S  =   1504    0   AGTTAAAATCCGTTTAAAACGTATGGGCGCTAAAAAAGCTCCTTTCTATCGTATTGTTGTTGCAGAC HHGIIJIIGGGHF@HIJJJIGDHGG>;@DFGHGGIJIIGCEGHHFECHF@DEDC@C>CB=AAC>;:A NM:i:0  MD:Z:44 AS:i:44 XS:i:44 SA:Z:g944564.1178,1508,+,28S39M,0,2;
ER9 377 1_2_gene2051    1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44
ER9 377 626939.HMPREF9443_01578 1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44

but the blocked output has its order inverted and the sequence ends up reapplied on the now first reported alignment.

% cat output-mapblock.sam 
@SQ SN:1_2_gene2051 LN:1547
@SQ SN:626939.HMPREF9443_01578  LN:1547
@SQ SN:4-12192_1_gene91203  LN:1547
ER9 377 626939.HMPREF9443_01578 1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44
ER9 377 1_2_gene2051    1   0   23H44M  =   1   0   *   *   NM:i:0  MD:Z:44 AS:i:44
ER9 73  4-12192_1_gene91203 1504    0   44M23S  =   1504    0   AGTTAAAATCCGTTTAAAACGTATGGGCGCTAAAAAAGCTCCTTTCTATCGTATTGTTGTTGCAGAC HHGIIJIIGGGHF@HIJJJIGDHGG>;@DFGHGGIJIIGCEGHHFECHF@DEDC@C>CB=AAC>;:A NM:i:0  MD:Z:44 AS:i:44 XS:i:44 SA:Z:g944564.1178,1508,+,28S39M,0,2;