Affinity table from the SELEX data

Thanks for your query. You can get the counts of every k-mer from the file and estimate the enrichment based on a threshold. The main function should be helpful (pasting the docstring)

def kmer_fraction_from_file(sequence_file, k=5, top=50,
    txt=False):
    """Iterate over the sequencing reads file and calculate
    kmer occurences (at most one per read) and subsequent
    PWM models from the top kmers

    Args:
        sequence_file (str): File containing sequencing
            reads, can be a txt, fasta or fastq file

    Kwargs:
        k (int): length of kmer (default=5)
        top (int): No. of top PWM models to calculate based
        on the seed kmer occurence (default=50)
        txt (bool): Whether the sequene_file is a txt file
            (default=False)

    Returns:
        counts (collections.Counter): kmer occurence (at most
            one per read) for all kmers
        fraction (dict): fraction of reads
            containing the kmer for all kmers
        collections.defaultdict(collections.defaultdict(
            collections.Counter)): A nested data structure
            with keys as kmer and values as a defaultdict
            of Counter containing nucleotide frequencies at
            every position along the length of the kmer
    """

You can use this function as follows

from eme_selex.eme import kmer_fraction_from_file as kf

k = 5

counts, fractions, pfm_models = kf(f"{your_fasta_file}.fasta.gz", k=k)

kashyapchhatbar / eme_selex

Affinity table from the SELEX data #3