Open ekg opened 5 years ago
Having a hook for a predefined SQ order would save on having to run ReorderSam
which is expensive for whole genome BAM files.
The hook would need to interact with and determine the order of the loop here: https://github.com/vgteam/vg/blob/master/src/alignment.cpp#L101-L103
Does ReorderSam just read the FASTSA (or its index?) to determine the correct order of contigs in the header? Is it convenient to have the original FASTA around for this? Or do we want to take a list on the command line instead?
ReorderSam uses a .dict
file which is just an order of SQs for each contig as they would appear in the SAM file. (https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format)
example:
[markellocj@biowulf construct_new_snp1kg_ref]$ head /data/markellocj/fasta_references/hs37d5_reference/hs37d5.dict
@HD VN:1.5
@SQ SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:5 LN:180915260 M5:0740173db9ffd264d728f32784845cd7 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:6 LN:171115067 M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:7 LN:159138663 M5:618366e953d6aaad97dbe4777c29375e UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:8 LN:146364022 M5:96f514a9929e410c6651697bded59aec UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:9 LN:141213431 M5:3e273117f15e0a400f01055d9f393768 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
I just noticed that vg is producing headers that mix up the order of the sequences in the FASTA file.
Some tools seem to want a fixed order in this header and in the backing FASTA reference. When we go through vg for mapping, we don't retain that order anywhere, so the output header comes out in the textual sorted order of the sequence names.
The fix is to apply Picard ReorderSam.