pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

Open tfenne opened 11 months ago

tfenne commented 11 months ago

The TP attribute in sequence lines in the SAM header is used to encode whether a reference sequence is linear or circular. Unfortunately AlignmentHeader.from_dict() silently removes this tag from the data:

> header = { "SQ": [{"SN": "some-plasmid", "LN": 5000, "TP": "circular"}] }
> AlignmentHeader.from_dict(header).to_dict()
OrderedDict([('SQ', [{'SN': 'some-plasmid', 'LN': 5000}])])
jmarshall commented 11 months ago

I really dislike this “I only believe in fields I know about” aspect of this part of pysam and would rather remove it. Those tables of known attributes are useful to return LN as int and so on, but IMHO unknown-at-the-time attributes should be preserved and default to str instead of being dropped.

Perhaps that's a quick fix too, but in the meantime PR #1238 would be a good stopgap.

tfenne commented 11 months ago

I totally agree @jmarshall.

Quoting the spec on the header (emphasis in the second paragraph mine):

1.3 The header section Each header line begins with the character ‘@’ followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant.

The following table describes the header record types that may be used and their predefined tags. Tags listed with ‘*’ are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.

jmarshall commented 10 months ago

Yes, I remember writing that text :smile: — though it was longer ago than I expected, in 2015.

It turns out that KNOWN_HEADER_FIELDS as used by to_dict() does default to str. However from_dict() uses VALID_HEADER_ORDER to constrain which non-lowercase fields it will process. So I've merged PR #1238 to improve this in the short term.

However in the medium term, I think the right thing to do for these conversions will be to rewrite this code using the post-1.10 HTSlib header API instead. At that point, I think most of this code that enumerates known tags will naturally disappear.

If you don't mind, I'll leave this issue open to represent that work.