Closed jvanheld closed 3 weeks ago
I asked ChatGPT to reformulate this in a properly formulated pseudo-code, here is the answer:
# Genetic Algorithm for PSSM Optimization
# INPUTS:
# - initial_matrices: a list of PSSMs (Generation 0)
# - n_generations: the total number of generations to evolve
# - k: the number of top-scoring matrices to select for reproduction
# - mutation_rate: the probability of mutation in matrices
# - mutation_function: a function to mutate a matrix
# - score_function: a function to score a matrix based on its discrimination ability
# INITIALIZATION:
current_generation = initial_matrices
# EVOLUTION PROCESS:
FOR each generation FROM 0 TO n_generations - 1 DO
# Step 1: Score the matrices of the current generation
scores = EMPTY LIST
FOR each matrix IN current_generation DO
score = score_function(matrix)
APPEND (matrix, score) TO scores
END FOR
# Step 2: Sort matrices by decreasing score
SORT scores BY score IN DESCENDING ORDER
# Step 3: Select the top-k scoring matrices
top_matrices = SELECT the first k matrices FROM scores
# Step 4: Create the next generation
next_generation = EMPTY LIST
FOR each matrix IN top_matrices DO
cloned_matrix = CLONE matrix
mutated_matrix = MUTATE cloned_matrix USING mutation_rate AND mutation_function
ADD mutated_matrix TO next_generation
END FOR
# Update the current generation for the next iteration
current_generation = next_generation
END FOR
# OUTPUT:
# - current_generation: the final set of optimized matrices
I asked ChatGPT to generate a code in Python, and provded the signatures of the functions implemented (or to be implemented for sorting).
Here is the answer
def genetic_algorithm(file_path, rsat_cmd, seq_file_pos, seq_file_neg, bg_file, n_generations, k, mutation_rate, n_children, file_prefix):
"""
Perform a genetic algorithm for optimizing Position-Specific Scoring Matrices (PSSMs).
This algorithm evolves an initial set of PSSMs over a specified number of generations,
selecting the top-performing matrices in each generation and generating new matrices
through mutation. The matrices are scored based on their ability to discriminate
between positive and negative sequences.
Parameters:
-----------
file_path : str
The path to the input file containing the initial PSSMs in a transfac format.
rsat_cmd : str
The command or path to the RSAT utility used for scoring the matrices.
seq_file_pos : str
The file path to the positive sequences in FASTA format.
seq_file_neg : str
The file path to the negative sequences in FASTA format.
bg_file : str
The file path to the background model used for scoring.
n_generations : int
The number of generations to evolve the matrices.
k : int
The number of top-scoring matrices to select for reproduction in each generation.
mutation_rate : float
The probability of applying a mutation to a matrix.
n_children : int
The number of offspring (mutated matrices) to generate from each selected parent matrix.
file_prefix : str
The prefix for the output files. Matrices for each generation will be saved with this
prefix, followed by `_gen#`, where `#` is the generation number, and the `.tf` suffix.
Returns:
--------
list
A list of the final set of optimized matrices after the last generation.
Example:
--------
>>> final_matrices = genetic_algorithm(file_path="input.tf",
rsat_cmd="/path/to/rsat",
seq_file_pos="positives.fasta",
seq_file_neg="negatives.fasta",
bg_file="background.bg",
n_generations=10,
k=5,
mutation_rate=0.1,
n_children=2,
file_prefix="output")
"""
# Step 1: Read the initial matrices (Generation 0)
current_generation = parse_transfac(file_path)
# Evolution Process
for generation in range(n_generations):
# Step 2: Score the matrices of the current generation
scores = []
for matrix in current_generation:
matrix_file = save_matrix_to_file(matrix) # Assuming a helper function to save matrix to file
score = score_matrix(rsat_cmd, seq_file_pos, seq_file_neg, matrix_file, bg_file)
scores.append((matrix, score))
# Step 3: Sort matrices by decreasing score
sorted_scores = sort_matrices(current_generation, scores)
# Step 4: Select the top-k scoring matrices
top_matrices = [matrix for matrix, score in sorted_scores[:k]]
# Step 5: Create the next generation
next_generation = []
for i, matrix in enumerate(top_matrices):
for _ in range(n_children):
cloned_matrix = copy.deepcopy(matrix)
mutated_matrix = mutate_pssm(cloned_matrix, gen_nb=generation+1, matrix_nb=i+1, n_children=n_children)
next_generation.append(mutated_matrix)
# Step 6: Export the matrices of the current generation
output_file = f"{file_prefix}_gen{generation}.tf"
export_pssms(next_generation, output_file, out_format='transfac')
# Prepare for the next iteration
current_generation = next_generation
# Return the final set of optimized matrices
return current_generation
And a usage example:
genetic_algorithm(file_path="input.tf",
rsat_cmd="/path/to/rsat",
seq_file_pos="positives.fasta",
seq_file_neg="negatives.fasta",
bg_file="background.bg",
n_generations=10,
k=5,
mutation_rate=0.1,
n_children=2,
file_prefix="output")
I will use this as template and adapt it to our needs
We now dispose of all the elements to implement the GA optimisation :
The GA would implement the following rationale:
read input matrices (generation 0) for g from 0 to n_generations: score matrices of generation g sort matrices by decreasing scores select the k top-scoring matrices create matrices of generation g+1 (clone and mutate)