Open tfenne opened 11 months ago
I really dislike this “I only believe in fields I know about” aspect of this part of pysam and would rather remove it. Those tables of known attributes are useful to return LN
as int
and so on, but IMHO unknown-at-the-time attributes should be preserved and default to str
instead of being dropped.
Perhaps that's a quick fix too, but in the meantime PR #1238 would be a good stopgap.
I totally agree @jmarshall.
Quoting the spec on the header (emphasis in the second paragraph mine):
1.3 The header section Each header line begins with the character ‘@’ followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match
/^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/
or/^@CO\t.*/
. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant.The following table describes the header record types that may be used and their predefined tags. Tags listed with ‘*’ are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.
Yes, I remember writing that text :smile: — though it was longer ago than I expected, in 2015.
It turns out that KNOWN_HEADER_FIELDS
as used by to_dict()
does default to str
. However from_dict()
uses VALID_HEADER_ORDER
to constrain which non-lowercase fields it will process. So I've merged PR #1238 to improve this in the short term.
However in the medium term, I think the right thing to do for these conversions will be to rewrite this code using the post-1.10 HTSlib header API instead. At that point, I think most of this code that enumerates known tags will naturally disappear.
If you don't mind, I'll leave this issue open to represent that work.
The
TP
attribute in sequence lines in the SAM header is used to encode whether a reference sequence is linear or circular. UnfortunatelyAlignmentHeader.from_dict()
silently removes this tag from the data: