ngless-toolkit / ngless

NGLess: NGS with less work
https://ngless.embl.de
Other
142 stars 24 forks source link

mapstats() should error with position sorted sam/bam files #80

Open unode opened 6 years ago

unode commented 6 years ago

Fixing this to handle position sorted bam/sam files will require a similar approach to #72 and keeping an index of read headers in-memory.

An easier solution is to detect if the file is positionally sorted and error informing the user that a name sorted (or unsorted) file is required. According to this biostars question the HD field can provide this info.

From a local test:

% head -n 1 texpected-unsorted.sam
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702  LN:30803

% head -n 2 texpected-pos_sorted.sam 
@HD VN:1.3  SO:coordinate
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702  LN:30803

% head -n 2 texpected-name_sorted.sam 
@HD VN:1.3  SO:queryname
@SQ SN:122_DCM_0d2-0d45_scaffold97490_1_gene122702  LN:30803

The question above also mentions:

% samtools view -H 5_110118_FC62VT6AAXX-hg18-unsort.bam
@HD    VN:1.0    SO:unsorted

but it seems this field isn't always present in unsorted files.

Regardless, failing in the case of SO:coordinate safeguards the user from erroneous results.