lindan1128 / Apis-Laboriosa-Project

Codes of paper: Comparative Genomics Reveals Recent Adaptive Evolution in Himalayan Giant Honeybee Apis laboriosa
8 stars 2 forks source link

4-fold degenerate sites #1

Open rillaxy opened 2 years ago

rillaxy commented 2 years ago

Hello, In the paper about A. laboriosa, you estimated divergence time among species based on the concatenated alignments of 4-fold degenerate sites from single-copy genes using MCMCtree in PAML v4.9. But I can't find this step in your command lines. Because I also want to estimated divergence time based on some conserved sequence, so how do you extract 4-fold degenerate sites from the concatenated alignments of single-copy genes ? I'm looking forward to your reply, thanks

lindan1128 commented 2 years ago

Halo @rillaxy , thanks for reading our paper! You can extract the 4dtv in the following way:

Step 1: Define 4DTv from DNA seq

## This code is from: https://www.biostars.org/p/157527/
bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

def translate(seq):
    seq = seq.lower().replace('\n', '').replace(' ', '')
    peptide = ''

    for i in range(0, len(seq), 3):
        codon = seq[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
        else:
            break

    return peptide

codon_dict={}
for aa in set(amino_acids):
    temp=[]
    for c in codons:
        if translate(c)==aa:
            temp.append(c)
    if len(temp)==4:
        codon_dict[aa]=temp

print(codon_dict)

Here, note that it misses the Ser, should add by yourself. (I notice this mistake recently, but as the paper has already submitted, I have not revised it.)

Step 2: Search 4DTv in your seqs

from Bio import SeqIO

position = list()

for seq in SeqIO.parse("path_to_your_seq_file.fasta", "fasta"):
    for i in range(0, len(seq), 3):
        codon = seq.seq[i: i+3]
        if any([codon[0:2] == 'GG', codon[1:2] == 'GC', codon[1:2] == 'AC', codon[1:2] == 'GT', codon[1:2] == 'CC']):
            position.append(i + 2)
print(position)

Step 3: Extract 4DTv in your seqs

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

your_seq_list = list()
extracted_seq = list()

for seq in SeqIO.parse("path_to_your_seq_file.fasta", "fasta"):
    if seq.id == 'XXXX':
        for i in range(0, len(sorted(set(position)))):
            your_seq_list.append(seq[i])
    extracted_seq.append(SeqRecord(Seq("".join(XXXX)), id = "XXXX", description = ''))

SeqIO.write(extracted_seq, "path_to_your_extracted_seq_file.fasta", "fasta")
rillaxy commented 2 years ago

Thanks @lindan1128 ,the codon_dict also misses L and R