pvhelden / optimize-matrix-GA

Optimise one or several position-specific scoring matrices based on their capability to discriminate positive from negative sequence sets.
GNU Affero General Public License v3.0
0 stars 0 forks source link

Implement function genetic_algorithm #18

Closed jvanheld closed 3 weeks ago

jvanheld commented 3 weeks ago

We now dispose of all the elements to implement the GA optimisation :

The GA would implement the following rationale:

read input matrices (generation 0) for g from 0 to n_generations: score matrices of generation g sort matrices by decreasing scores select the k top-scoring matrices create matrices of generation g+1 (clone and mutate)

jvanheld commented 3 weeks ago

I asked ChatGPT to reformulate this in a properly formulated pseudo-code, here is the answer:

# Genetic Algorithm for PSSM Optimization

# INPUTS:
# - initial_matrices: a list of PSSMs (Generation 0)
# - n_generations: the total number of generations to evolve
# - k: the number of top-scoring matrices to select for reproduction
# - mutation_rate: the probability of mutation in matrices
# - mutation_function: a function to mutate a matrix
# - score_function: a function to score a matrix based on its discrimination ability

# INITIALIZATION:
current_generation = initial_matrices

# EVOLUTION PROCESS:
FOR each generation FROM 0 TO n_generations - 1 DO

    # Step 1: Score the matrices of the current generation
    scores = EMPTY LIST
    FOR each matrix IN current_generation DO
        score = score_function(matrix)
        APPEND (matrix, score) TO scores
    END FOR

    # Step 2: Sort matrices by decreasing score
    SORT scores BY score IN DESCENDING ORDER

    # Step 3: Select the top-k scoring matrices
    top_matrices = SELECT the first k matrices FROM scores

    # Step 4: Create the next generation
    next_generation = EMPTY LIST
    FOR each matrix IN top_matrices DO
        cloned_matrix = CLONE matrix
        mutated_matrix = MUTATE cloned_matrix USING mutation_rate AND mutation_function
        ADD mutated_matrix TO next_generation
    END FOR

    # Update the current generation for the next iteration
    current_generation = next_generation

END FOR

# OUTPUT:
# - current_generation: the final set of optimized matrices
jvanheld commented 3 weeks ago

I asked ChatGPT to generate a code in Python, and provded the signatures of the functions implemented (or to be implemented for sorting).

Here is the answer

def genetic_algorithm(file_path, rsat_cmd, seq_file_pos, seq_file_neg, bg_file, n_generations, k, mutation_rate, n_children, file_prefix):
    """
    Perform a genetic algorithm for optimizing Position-Specific Scoring Matrices (PSSMs).

    This algorithm evolves an initial set of PSSMs over a specified number of generations, 
    selecting the top-performing matrices in each generation and generating new matrices 
    through mutation. The matrices are scored based on their ability to discriminate 
    between positive and negative sequences.

    Parameters:
    -----------
    file_path : str
        The path to the input file containing the initial PSSMs in a transfac format.
    rsat_cmd : str
        The command or path to the RSAT utility used for scoring the matrices.
    seq_file_pos : str
        The file path to the positive sequences in FASTA format.
    seq_file_neg : str
        The file path to the negative sequences in FASTA format.
    bg_file : str
        The file path to the background model used for scoring.
    n_generations : int
        The number of generations to evolve the matrices.
    k : int
        The number of top-scoring matrices to select for reproduction in each generation.
    mutation_rate : float
        The probability of applying a mutation to a matrix.
    n_children : int
        The number of offspring (mutated matrices) to generate from each selected parent matrix.
    file_prefix : str
        The prefix for the output files. Matrices for each generation will be saved with this 
        prefix, followed by `_gen#`, where `#` is the generation number, and the `.tf` suffix.

    Returns:
    --------
    list
        A list of the final set of optimized matrices after the last generation.

    Example:
    --------
    >>> final_matrices = genetic_algorithm(file_path="input.tf", 
                                           rsat_cmd="/path/to/rsat", 
                                           seq_file_pos="positives.fasta", 
                                           seq_file_neg="negatives.fasta", 
                                           bg_file="background.bg", 
                                           n_generations=10, 
                                           k=5, 
                                           mutation_rate=0.1, 
                                           n_children=2, 
                                           file_prefix="output")
    """
    # Step 1: Read the initial matrices (Generation 0)
    current_generation = parse_transfac(file_path)

    # Evolution Process
    for generation in range(n_generations):
        # Step 2: Score the matrices of the current generation
        scores = []
        for matrix in current_generation:
            matrix_file = save_matrix_to_file(matrix)  # Assuming a helper function to save matrix to file
            score = score_matrix(rsat_cmd, seq_file_pos, seq_file_neg, matrix_file, bg_file)
            scores.append((matrix, score))

        # Step 3: Sort matrices by decreasing score
        sorted_scores = sort_matrices(current_generation, scores)

        # Step 4: Select the top-k scoring matrices
        top_matrices = [matrix for matrix, score in sorted_scores[:k]]

        # Step 5: Create the next generation
        next_generation = []
        for i, matrix in enumerate(top_matrices):
            for _ in range(n_children):
                cloned_matrix = copy.deepcopy(matrix)
                mutated_matrix = mutate_pssm(cloned_matrix, gen_nb=generation+1, matrix_nb=i+1, n_children=n_children)
                next_generation.append(mutated_matrix)

        # Step 6: Export the matrices of the current generation
        output_file = f"{file_prefix}_gen{generation}.tf"
        export_pssms(next_generation, output_file, out_format='transfac')

        # Prepare for the next iteration
        current_generation = next_generation

    # Return the final set of optimized matrices
    return current_generation

And a usage example:

genetic_algorithm(file_path="input.tf", 
                  rsat_cmd="/path/to/rsat", 
                  seq_file_pos="positives.fasta", 
                  seq_file_neg="negatives.fasta", 
                  bg_file="background.bg", 
                  n_generations=10, 
                  k=5, 
                  mutation_rate=0.1, 
                  n_children=2, 
                  file_prefix="output")
jvanheld commented 3 weeks ago

I will use this as template and adapt it to our needs