oxli-bio / oxli

k-mers and the like
BSD 3-Clause "New" or "Revised" License
15 stars 0 forks source link

Add Count Thresholding #18

Closed Adamtaranto closed 2 months ago

Adamtaranto commented 2 months ago

It is often useful to exclude low abundance (erroneous) or high abundance (repeat associated) kmers from a count table.

As a user I'd expect a method called .min() to return all the kmers with the minimum observed count and .max() to be all kmers with the max observed count.

For thresholding at some cutoff value, maybe something like .mincut() and .maxcut() ?

Suggested use:

table = oxli.KmerCountTable(3)
kmers = ["AAA", "GGG", "GGG"]

for kmer in kmers:
    table.count(kmer)

table.mincut(2)
>> "Dropped 1 hash with fewer than 2 counts."

table.get("AAA")
>> 0

table.get("GGG")
>> 2

@ctb?