vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

vg surjection produces BAM headers with out of order sequence records #2069

Open ekg opened 5 years ago

ekg commented 5 years ago

I just noticed that vg is producing headers that mix up the order of the sequences in the FASTA file.

@HD     VN:1.3  SO:coordinate
@SQ     SN:1    LN:249250621
@SQ     SN:10   LN:135534747
@SQ     SN:11   LN:135006516
@SQ     SN:12   LN:133851895
@SQ     SN:13   LN:115169878
@SQ     SN:14   LN:107349540
@SQ     SN:15   LN:102531392
@SQ     SN:16   LN:90354753                                                                                                                                                                                          @SQ     SN:17   LN:81195210                                                                                                                                                                                          @SQ     SN:18   LN:78077248                                                                                                                                                                                          @SQ     SN:19   LN:59128983                                                                                                                                                                                          @SQ     SN:2    LN:243199373                                                                                                                                                                                         @SQ     SN:20   LN:63025520
@SQ     SN:21   LN:48129895
@SQ     SN:22   LN:51304566
@SQ     SN:3    LN:198022430
@SQ     SN:4    LN:191154276
@SQ     SN:5    LN:180915260                                                                                                                                                                                         @SQ     SN:6    LN:171115067
@SQ     SN:7    LN:159138663                                                                                                                                                                                         @SQ     SN:8    LN:146364022
@SQ     SN:9    LN:141213431

Some tools seem to want a fixed order in this header and in the backing FASTA reference. When we go through vg for mapping, we don't retain that order anywhere, so the output header comes out in the textual sorted order of the sequence names.

The fix is to apply Picard ReorderSam.

java -jar ~/graphs/bin/picard.jar ReorderSam I=x.baddest.calmd.bam R=/nfs/users/nfs_e/eg10/graphs/hs37d5.fa O=x.baddest.calmd.reorder.bam
cmarkello commented 4 years ago

Having a hook for a predefined SQ order would save on having to run ReorderSam which is expensive for whole genome BAM files. The hook would need to interact with and determine the order of the loop here: https://github.com/vgteam/vg/blob/master/src/alignment.cpp#L101-L103

adamnovak commented 4 years ago

Does ReorderSam just read the FASTSA (or its index?) to determine the correct order of contigs in the header? Is it convenient to have the original FASTA around for this? Or do we want to take a list on the command line instead?

cmarkello commented 4 years ago

ReorderSam uses a .dict file which is just an order of SQs for each contig as they would appear in the SAM file. (https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format)

example:

[markellocj@biowulf construct_new_snp1kg_ref]$ head /data/markellocj/fasta_references/hs37d5_reference/hs37d5.dict
@HD VN:1.5
@SQ SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:2    LN:243199373    M5:a0d9851da00400dec1098a9255ac712e UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:3    LN:198022430    M5:fdfd811849cc2fadebc929bb925902e5 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:4    LN:191154276    M5:23dccd106897542ad87d2765d28a19a1 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:5    LN:180915260    M5:0740173db9ffd264d728f32784845cd7 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:6    LN:171115067    M5:1d3a93a248d92a729ee764823acbbc6b UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:7    LN:159138663    M5:618366e953d6aaad97dbe4777c29375e UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:8    LN:146364022    M5:96f514a9929e410c6651697bded59aec UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa
@SQ SN:9    LN:141213431    M5:3e273117f15e0a400f01055d9f393768 UR:file:/gpfs/gsfs2/users/Udpbinfo/usr/markellocj/hs37d5.fa