The existing Sequence.gc method purposefully ignores characters other than G/C and uses the sequence length as a denominator to produce "fraction g/c". This has a few benefits:
I'd welcome any pull request to implement something like:
Sequence.gc_iupac method that counts e.g. S=GC and W=AT, and also considers K=GT. This is considerably more difficult than the current method and requires some validation of the sequence to confirm that it only contains valid IUPAC letters
Sequence.gc_strict method that counts G/C and A/T, implicitly ignoring all other characters. This is probably closest to what people expect as GC content
The existing
Sequence.gc
method purposefully ignores characters other than G/C and uses the sequence length as a denominator to produce "fraction g/c". This has a few benefits:len(sequence)
is fast to compute vs. counting more occurrences of charactersThe downside is that any non-GCAT characters may be included in the denominator:
https://github.com/mdshw5/pyfaidx/blob/7b4d8d7aceadaa1fde05846e854e6eccdba38b77/pyfaidx/__init__.py#L254-L266
I'd welcome any pull request to implement something like:
Sequence.gc_iupac
method that counts e.g. S=GC and W=AT, and also considers K=GT. This is considerably more difficult than the current method and requires some validation of the sequence to confirm that it only contains valid IUPAC lettersSequence.gc_strict
method that counts G/C and A/T, implicitly ignoring all other characters. This is probably closest to what people expect as GC content