kashyapchhatbar / eme_selex

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Affinity table from the SELEX data #3

Open UpalabdhaD opened 1 month ago

UpalabdhaD commented 1 month ago

Hi, thanks for this useful software, I was using this tool, and currently wondering if it is possible to generate a affinity table for each enriched k-mer in a library. I found one tool in bioconductor, 'SELEX' package, but a pythonic solution would be much versatile IMHO.

Thanks in advance!

kashyapchhatbar commented 4 weeks ago

Thanks for your query. You can get the counts of every k-mer from the file and estimate the enrichment based on a threshold. The main function should be helpful (pasting the docstring)

def kmer_fraction_from_file(sequence_file, k=5, top=50,
    txt=False):
    """Iterate over the sequencing reads file and calculate
    kmer occurences (at most one per read) and subsequent
    PWM models from the top kmers

    Args:
        sequence_file (str): File containing sequencing
            reads, can be a txt, fasta or fastq file

    Kwargs:
        k (int): length of kmer (default=5)
        top (int): No. of top PWM models to calculate based
        on the seed kmer occurence (default=50)
        txt (bool): Whether the sequene_file is a txt file
            (default=False)

    Returns:
        counts (collections.Counter): kmer occurence (at most
            one per read) for all kmers
        fraction (dict): fraction of reads
            containing the kmer for all kmers
        collections.defaultdict(collections.defaultdict(
            collections.Counter)): A nested data structure
            with keys as kmer and values as a defaultdict
            of Counter containing nucleotide frequencies at
            every position along the length of the kmer
    """   

You can use this function as follows

from eme_selex.eme import kmer_fraction_from_file as kf

k = 5

counts, fractions, pfm_models = kf(f"{your_fasta_file}.fasta.gz", k=k)