Open UpalabdhaD opened 1 month ago
Thanks for your query. You can get the counts of every k-mer from the file and estimate the enrichment based on a threshold. The main function should be helpful (pasting the docstring)
def kmer_fraction_from_file(sequence_file, k=5, top=50,
txt=False):
"""Iterate over the sequencing reads file and calculate
kmer occurences (at most one per read) and subsequent
PWM models from the top kmers
Args:
sequence_file (str): File containing sequencing
reads, can be a txt, fasta or fastq file
Kwargs:
k (int): length of kmer (default=5)
top (int): No. of top PWM models to calculate based
on the seed kmer occurence (default=50)
txt (bool): Whether the sequene_file is a txt file
(default=False)
Returns:
counts (collections.Counter): kmer occurence (at most
one per read) for all kmers
fraction (dict): fraction of reads
containing the kmer for all kmers
collections.defaultdict(collections.defaultdict(
collections.Counter)): A nested data structure
with keys as kmer and values as a defaultdict
of Counter containing nucleotide frequencies at
every position along the length of the kmer
"""
You can use this function as follows
from eme_selex.eme import kmer_fraction_from_file as kf
k = 5
counts, fractions, pfm_models = kf(f"{your_fasta_file}.fasta.gz", k=k)
Hi, thanks for this useful software, I was using this tool, and currently wondering if it is possible to generate a affinity table for each enriched k-mer in a library. I found one tool in bioconductor, 'SELEX' package, but a pythonic solution would be much versatile IMHO.
Thanks in advance!