microsoft / evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models
MIT License
526 stars 73 forks source link

Question not Issue #46

Open marjanUofT opened 2 months ago

marjanUofT commented 2 months ago

The task is to generate new sequences by introducing a random number of mutations at random positions within a specific range of positions, such as [29, 110).

I have the wild-type (WT) sequences, and I need to:

Apply a random number of mutations. Mutate random positions within the defined range. I am unsure if any of the available models can achieve this directly. If no suitable model exists, I can implement a method to generate mutations at random positions and then pass each mutated sequence to a model.

The goal is to generate at least 1,000 new sequences.

yangkky commented 2 months ago

We don't have this functionality implemented but it seems pretty straightforward.

On Fri, Sep 6, 2024, 7:47 PM Marjan Mohammadi @.***> wrote:

The task is to generate new sequences by introducing a random number of mutations at random positions within a specific range of positions, such as [29, 110).

I have the wild-type (WT) sequences, and I need to:

Apply a random number of mutations. Mutate random positions within the defined range. I am unsure if any of the available models can achieve this directly. If no suitable model exists, I can implement a method to generate mutations at random positions and then pass each mutated sequence to a model.

The goal is to generate at least 1,000 new sequences.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/evodiff/issues/46 or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEMNWA4YYTFG73WGMOEDWLZVI5JNBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJVGAYDSNRZGY3TTAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDENJRGEZTEMZTHE42O5DSNFTWOZLSUZRXEZLBORSQ . You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

marjanUofT commented 2 months ago

Thank you, @yangkky, for your quick response. Below is the code for generating mutation lists using the Poisson distribution and applying the "OA_DM_38M" model. This should help anyone looking to achieve similar results:


def generate_mutation_lists(num_lists=1000, mean_mutations=15, min_pos=21, max_pos=110):
    all_lists = []

    for _ in range(num_lists):
        num_mutations = np.random.poisson(mean_mutations)

        ends_list = np.random.randint(min_pos, max_pos + 1, num_mutations)
        ends_list.sort()  # Sort the end positions

        start_list = ends_list - 1

        start_list = np.clip(start_list, min_pos, max_pos)

        all_lists.append((start_list.tolist(), ends_list.tolist()))

    return all_lists

mutation_lists = generate_mutation_lists()  # This should be the result of the function you previously ran

total_num_gen_seqs = 1000
generated_sequences = {}

for idx, (start_ids, end_ids) in enumerate(mutation_lists):
    if idx >= total_num_gen_seqs:
        break
    start_ids = [start_ids]
    end_ids = [end_ids]

    masked_sequences = mask_sequences(sequences, start_ids, end_ids)

    tokenizer = tokenizer
    tokenized_sequences = tokenize_sequences(masked_sequences, tokenizer, device)

    new_sequences = generate_unique_sequences(model, tokenized_sequences, start_ids, end_ids, sequences, tokenizer, num_gen_seqs = 1)

    generated_sequences[idx] = new_sequences