vtarasv / 3d-prot-dta

3DProtDTA: a deep learning model for drug-target affinity prediction based on residue-level protein graphs
https://doi.org/10.1039/D3RA00281K
16 stars 2 forks source link

alphfold2 kinases #3

Closed speakstone closed 9 months ago

speakstone commented 1 year ago

I noticed that when using alphafold2 to predict the protein structure, you only extracted the kinases part, can you explain how to do it in detail? I hope to do the same type of structure prediction and analysis on other datasets. The research on the role of kinases is very interesting.

vtarasv commented 1 year ago

Sure, given the correspondence between the sequences produced by AlphaFold2 and those documented in UniProt, it becomes feasible to extract domain location data from UniProt by utilising the protein accession identifier (e.g., P00519). Subsequently, the AlphaFold2-generated structure may be precisely segmented based on the parsed domain start and end positions. Please attempt the provided example below. In this specific instance, it is anticipated that one of the resulting outputs will include the feature categorised as a domain with the description of Protein kinase, encompassing the range of 'begin': '242' to 'end': '493'. It is worth noting that alternative annotations for this feature may exist, in our paper we focused on identifying protein kinase, histidine kinase, PI3K/PI4K, PIPK, AGC-kinase, or CBS. In order to isolate the kinase domain, it is necessary to selectively truncate the structure to encompass residues indexed between 242 and 493.

import xml.etree.ElementTree as ET
import urllib.request

def get_xml(upacc):
    url = f'https://www.uniprot.org/uniprot/{upacc}.xml'

    try:
        response = urllib.request.urlopen(url).read()
    except (urllib.error.HTTPError, urllib.error.URLError) as e:
        with open("log/err_up_urllib.error.HTTPError|URLError.txt", "a") as f:
            print(upacc, "|", e, file=f)
        return

    try:
        root = ET.fromstring(response)
    except ET.ParseError as e:
        with open("log/err_up_ET.ParseError.txt", "a") as f:
            print(upacc, "|", e, file=f)
        return

    return root

def parse_feature(child, d, upacc):
    assert child.tag == "{http://uniprot.org/uniprot}feature"
    type_ = child.attrib["type"]
    description = ""
    if "description" in child.attrib:
        description = child.attrib["description"]

    position, begin, end = "", "", ""
    for prop in child:
        if prop.tag == "{http://uniprot.org/uniprot}location":
            for prop1 in prop:
                if prop1.tag == "{http://uniprot.org/uniprot}position":
                    if "position" in prop1.attrib:
                        position = prop1.attrib["position"]
                elif prop1.tag == "{http://uniprot.org/uniprot}begin":
                    if "position" in prop1.attrib:
                        begin = prop1.attrib["position"]
                elif prop1.tag == "{http://uniprot.org/uniprot}end":
                    if "position" in prop1.attrib:
                        end = prop1.attrib["position"]

    if position and not begin and not end:
        begin, end = position, position
    elif not position and begin and end:
        pass
    else:
        with open("log/err_up_parse_features_no_or_mix_loc.txt", "a") as f:
            print(upacc, "|", type_, "|", description, file=f)
        pass
    if begin and end:
        if type_ not in d:
            d[type_] = list()
        d[type_].append({"description": description, "begin": begin, "end": end})

def parse_upacc(root, upacc):
    d_feat = dict()
    for entry in root:
        for child in entry:
            if child.tag == "{http://uniprot.org/uniprot}feature":
                parse_feature(child, d_feat, upacc)
    return d_feat

upacc = 'P00519'
root = get_xml(upacc )
data = parse_upacc(root, upacc)
speakstone commented 1 year ago

当然,考虑到 AlphaFold2 产生的序列与 UniProt 中记录的序列之间的对应关系,通过利用蛋白质登录标识符(例如P00519)从 UniProt 中提取域位置数据变得可行。随后,可以基于解析的域开始和结束位置精确分割AlphaFold2生成的结构。请尝试下面提供的示例。在此特定实例中,预计所得输出之一将包括分类为具有蛋白激酶描述的域的特征,涵盖“开始”:“242”到“结束”:“493”的范围。值得注意的是,在我们的论文中可能存在此功能的替代注释我们重点鉴定蛋白激酶、组氨酸激酶、PI3K/PI4K、PIPK、AGC 激酶或 CBS。为了分离激酶结构域,有必要选择性截断结构以包含索引在 242 和 493 之间的残基。

import xml.etree.ElementTree as ET
import urllib.request

def get_xml(upacc):
    url = f'https://www.uniprot.org/uniprot/{upacc}.xml'

    try:
        response = urllib.request.urlopen(url).read()
    except (urllib.error.HTTPError, urllib.error.URLError) as e:
        with open("log/err_up_urllib.error.HTTPError|URLError.txt", "a") as f:
            print(upacc, "|", e, file=f)
        return

    try:
        root = ET.fromstring(response)
    except ET.ParseError as e:
        with open("log/err_up_ET.ParseError.txt", "a") as f:
            print(upacc, "|", e, file=f)
        return

    return root

def parse_feature(child, d, upacc):
    assert child.tag == "{http://uniprot.org/uniprot}feature"
    type_ = child.attrib["type"]
    description = ""
    if "description" in child.attrib:
        description = child.attrib["description"]

    position, begin, end = "", "", ""
    for prop in child:
        if prop.tag == "{http://uniprot.org/uniprot}location":
            for prop1 in prop:
                if prop1.tag == "{http://uniprot.org/uniprot}position":
                    if "position" in prop1.attrib:
                        position = prop1.attrib["position"]
                elif prop1.tag == "{http://uniprot.org/uniprot}begin":
                    if "position" in prop1.attrib:
                        begin = prop1.attrib["position"]
                elif prop1.tag == "{http://uniprot.org/uniprot}end":
                    if "position" in prop1.attrib:
                        end = prop1.attrib["position"]

    if position and not begin and not end:
        begin, end = position, position
    elif not position and begin and end:
        pass
    else:
        with open("log/err_up_parse_features_no_or_mix_loc.txt", "a") as f:
            print(upacc, "|", type_, "|", description, file=f)
        pass
    if begin and end:
        if type_ not in d:
            d[type_] = list()
        d[type_].append({"description": description, "begin": begin, "end": end})

def parse_upacc(root, upacc):
    d_feat = dict()
    for entry in root:
        for child in entry:
            if child.tag == "{http://uniprot.org/uniprot}feature":
                parse_feature(child, d_feat, upacc)
    return d_feat

upacc = 'P00519'
root = get_xml(upacc )
data = parse_upacc(root, upacc)

Certainly, I can help you refine the response for GitHub. Here it is:

Thank you for providing such an insightful explanation! The code you supplied thoroughly delineates the process of obtaining the index for the domain structure, which includes the SH3, SH2, and kinase domains. This information is crucial and highly beneficial.

However, I find myself with a lingering question. Is the methodology reliant on using AlphaFold2 to predict the complete protein's amino acid structure initially, followed by employing the index range of the domain structure for further staging? To illustrate, would AlphaFold2 predict the sequence structure of amino acids from positions 1 to 1000, and then, utilizing information from UniProt, identify the kinase domain in the range of 242-493, subsequently slicing the PDB file accordingly? This clarification would greatly enhance my understanding of the process.

Looking forward to hearing from you soon!

vtarasv commented 1 year ago

Yes, it is correct. You can find AlphaFold predicted structures in corresponding Uniprot profiles of a protein or AlphaFold database.

speakstone commented 1 year ago

Yes, it is correct. You can find AlphaFold predicted structures in corresponding Uniprot profiles of a protein or AlphaFold database.

One more question: I've noticed that for different PDBs, the source of prediction may vary, including AP2-ASSOCIATED PROTEIN KINASE 1, TYROSINE-PROTEIN KINASE ABL1, TYROSINE-PROTEIN KINASE ABL2, ACTIVIN RECEPTOR TYPE-1, etc. As you previously mentioned, these could be categorized as protein kinases, histidine kinases, PI3K/PI4K, PIPK, AGC-kinase, or CBS. How do you determine which category to use for different proteins? Is the choice based on what UniProt has listed? What happens if there is more than one category available for a given protein? Could you please clarify this process? Thank you!

speakstone commented 1 year ago

Yes, it is correct. You can find AlphaFold predicted structures in corresponding Uniprot profiles of a protein or AlphaFold database.

I have printed all your current cases, is it only the 'chain' part of the UniProt result? What if the length of this list exceeds 1?

vtarasv commented 1 year ago

I do a comprehensive search for protein domain annotations and subsequently narrow down the selection to those that are specifically associated with kinases based on contextual relevance. In order to identify the necessary annotation for your set of proteins, it may be necessary to do a manual inspection of the most often occurring annotations.

One more question: I've noticed that for different PDBs, the source of prediction may vary, including AP2-ASSOCIATED PROTEIN KINASE 1, TYROSINE-PROTEIN KINASE ABL1, TYROSINE-PROTEIN KINASE ABL2, ACTIVIN RECEPTOR TYPE-1, etc. As you previously mentioned, these could be categorized as protein kinases, histidine kinases, PI3K/PI4K, PIPK, AGC-kinase, or CBS. How do you determine which category to use for different proteins? Is the choice based on what UniProt has listed? What happens if there is more than one category available for a given protein? Could you please clarify this process? Thank you!

vtarasv commented 1 year ago

Could you, please, clarify the question?

I have printed all your current cases, is it only the 'chain' part of the UniProt result? What if the length of this list exceeds 1?

speakstone commented 1 year ago

I do a comprehensive search for protein domain annotations and subsequently narrow down the selection to those that are specifically associated with kinases based on contextual relevance. In order to identify the necessary annotation for your set of proteins, it may be necessary to do a manual inspection of the most often occurring annotations.

One more question: I've noticed that for different PDBs, the source of prediction may vary, including AP2-ASSOCIATED PROTEIN KINASE 1, TYROSINE-PROTEIN KINASE ABL1, TYROSINE-PROTEIN KINASE ABL2, ACTIVIN RECEPTOR TYPE-1, etc. As you previously mentioned, these could be categorized as protein kinases, histidine kinases, PI3K/PI4K, PIPK, AGC-kinase, or CBS. How do you determine which category to use for different proteins? Is the choice based on what UniProt has listed? What happens if there is more than one category available for a given protein? Could you please clarify this process? Thank you!

Thank you immensely for your detailed response. I've gained a complete understanding of this aspect of your work, and I must say it's truly commendable. The depth and rigor of your research are evident, and it's genuinely inspiring to see the contributions you've made to the field.