pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Feature proposal: SAM tag enum #1272

Open msto opened 5 months ago

msto commented 5 months ago

Hi,

I think it would be valuable to add two features to improve the use of SAM tags.

  1. An enum describing the standard SAM tags.
  2. A class decorator to enforce tag conventions when declaring locally-defined tags .

Would these features be welcomed into pysam?

I am happy to implement these but would appreciate feedback on whether this is a contribution that would be accepted into pysam, and if so, on some design considerations before starting.

Thank you!

SAM tag enum

The primary question I have regarding a SAM tag enum is whether the member names should be the actual SAM tags, or more semantically meaningful?

e.g.

class SamTag(str, Enum):
    """Standard SAM tags."""

    RG: "RG"
    """Read group."""

    RX: "RX"
    """Sequence bases of the (possibly corrected) unique molecular identifier."""

or

class SamTag(str, Enum):
    """Standard SAM tags."""

    READ_GROUP: "RG"
    """Read group."""

    UMI: "RX"
    """Sequence bases of the (possibly corrected) unique molecular identifier."""

(note that I suggest mixing in str or subclassing StrEnum so the enums can be passed directly to pysam's tagging functions, e.g. read.has_tag(SamTag.UMI))

SAM tag decorator

To support locally-defined tags, I would propose providing an enumeration class decorator that implements the following validations:

  1. Enforce uniqueness (using enum.unique)
  2. Enforce that tags are two-character strings
  3. Optionally enforce that locally-defined tags adhere to SAM convention, namely that tags start with "X", "Y", or "Z", or are lowercase

e.g.

@sam_tag(strict=True)
class CustomTag(str, Enum):
    """Custom SAM tags used for $project."""

    FOO: "XF"
    """Foo."""

    BAR: "XB"
    """Bar."""
msto commented 4 months ago

I have a proof-of-concept for this feature that I'd happily open a PR for here, if it's a contribution that you think would be sensible to add to pysam

https://github.com/msto/sam_tags/