Fixing this to handle position sorted bam/sam files will require a similar approach to #72 and keeping an index of read headers in-memory.
An easier solution is to detect if the file is positionally sorted and error informing the user that a name sorted (or unsorted) file is required.
According to this biostars question the HD field can provide this info.
From a local test:
% head -n 1 texpected-unsorted.sam
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702 LN:30803
% head -n 2 texpected-pos_sorted.sam
@HD VN:1.3 SO:coordinate
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702 LN:30803
% head -n 2 texpected-name_sorted.sam
@HD VN:1.3 SO:queryname
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702 LN:30803
Fixing this to handle position sorted bam/sam files will require a similar approach to #72 and keeping an index of read headers in-memory.
An easier solution is to detect if the file is positionally sorted and error informing the user that a name sorted (or unsorted) file is required. According to this biostars question the
HD
field can provide this info.From a local test:
The question above also mentions:
but it seems this field isn't always present in unsorted files.
Regardless, failing in the case of
SO:coordinate
safeguards the user from erroneous results.