Open jmarshall opened 2 years ago
What if it just uses a .dict file by default when present? while .hdr is more flexible, .dict files are already in use.
The contents (a bunch of SAM headers) would be much the same, just the filename would be different. I considered using the existing.dict
convention for this, but there are several reasons not to:
GRCh38.fa
, the usual name for a corresponding dict file would be GRCh38.dict
but all the rest of bwa's index files are of the form GRCh38.fa.xyz
. (This is the most compelling reason.)@SQ
and maybe @HD
headers. The intention here is that it may contain any header — there would be use cases for adding constant @PG
or @CO
headers to your mapping output..hdr
. If the filename were .dict
, users might infer that it was used by bwa as a dictionary to do lookups of some kind.(You can always use bwa mem -H foo.dict …
to use a dict file with whatever filename it might have.)
Are there types of lines that should be disallowed? In (1), it looks like the intention is that the file should live statically alongside the other bwa files. So RG, and perhaps even PG, should not be static. RG is obvious as to why, but PG being in the file would be misleading if we use a newer version of bwa with older and compatible index folds.
Given .dict files are found a lot in the wild, and can be generated by multiple tools, how about they get used if the .hdr file does not exist? I find having yet another auxiliary file I have to maintain when .dict already exists confusing and cumbersome. If folks want a custom header, they can just specify it similar to your .dict example with the -H option. And is this file generated by bwa, then hand modified to add additional lines, or even add AH fields to SQ lines? If so, then should it really be codified as a bwa generated static file? I don’t think so.
.alt
file, it is prepared separately rather than by bwa index
. (It might be worth adding code so that bwa index
would generate a starting point for this file, but as tools like CreateSequenceDictionary
and samtools dict
already do this there's not much need.)@PG ID:bwa …
line containing its version number. If you added some static @PG
lines to this file, it would be to represent other early-stage processing such as bcl2fastq. (This is indeed a bit esoteric; @SQ
will be what is common in this file.)bwa mem -H /path/to/ref.fa.hdr …
every time they run BWA to do mapping (mem
users anyway; samse
/sampe
users do not have such an option). But the point of this PR is to make it easy: to make decorated @SQ
lines something that can be set up once when you set up the index.SP
, AS
, M5
, TP
, and AN
fields to the @SQ
lines. Especially the latter are not fully automatable, so will require hand-editing from some starting point.AH
fields corresponding to the .alt
file would be painful. This is why I have a followup samtools PR coming, which will add an option to samtools dict
to read the .alt
file and emit AH
appropriately. (And this is the real reason why this starting point is best prepared by an external dict tool: the .alt
file may not exist yet when bwa index
is run, so that's really too soon to create this starting point.)This file is a dict file, just with a slightly different purpose, so the different name adds some flexibility. But if @lh3 prefers, I would add the code to strip a suffix and add .dict
instead of just adding .hdr
to …ref.fa
in the same way that other index file names are constructed.
For the time being, maybe let's stop quibbling about the filename so that we can hear the BWA maintainer's thoughts about the proposed feature :smile:
@jmarshall open to whatever works here
@nh13: I've updated it to accept either:— hence it reads headers from <prefix>.hdr
if present, otherwise reads headers from <baseprefix>.dict
if that is present; (otherwise does the default thing).
The string munching required to get from foo.fa[sta][.gz]
to foo.dict
was not much fun to write :smile: but you are right that accepting that filename convention as well is the best thing to do.
Introduce a
<prefix>.hdr
file that the sysadmin or user should place alongside the other index files, useful for setting up detailed@SQ
headers etc to be used by default when mapping.The idea is to make it easy to have rich
@SQ
headers (and other headers if desired) in bwa mapped output. Rather than needing to use-H
every time, you can set up a<prefix>.hdr
file at the same time as you generate the bwa index. In particular, this can be used to add descriptive fields such as MD5, species, assembly, and alternative names (which enables querying for e.g.chr1
or1
interchangeably; see samtools/hts-specs#100 and samtools/htslib#931).This PR adds to the
mem
/samse
/sampe
commands so that they output SAM headers from:-H
options;.hdr
file stored with the index files;@HD
/@SQ
default headers.An
@HD
header is always printed, with an@HD
line from-H
overriding one from<prefix>.hdr
(if any), which overrides the default one added since #336. A set of@SQ
headers is always printed, with-H
options overriding<prefix>.hdr
overriding the default, similarly.Other headers specified in either
-H
or<prefix>.hdr
are all printed.If
-H
includes@SQ
headers, we consider that the user has carefully set up all the headers to be output and ignore<prefix>.hdr
entirely.Also adds a brief description of
<prefix>.alt
and a full description of<prefix>.hdr
to the manual page.