schulter / EMOGI

An explainable multi-omics graph integration method based on graph convolutional networks to predict cancer genes.
GNU General Public License v3.0
138 stars 33 forks source link

"genemap_search_cancer.txt" and "genemap2.txt" #4

Closed rojinsafavi closed 3 years ago

rojinsafavi commented 3 years ago

Could you please elaborate the difference between "genemap_search_cancer.txt" and "genemap2.txt"? I am not sure where I can find the first one

def get_negative_labels(nodes, positives, ppi_network, min_degree=1, verbose=False):
    if verbose:
        print ("{} genes are in network".format(nodes.shape[0]))
    # get rid of the positives (known cancer genes)
    not_positives = nodes[~nodes.Name.isin(positives)]
    if verbose:
        print ("{} genes are in network but not in positives (known cancer genes from NCG)".format(not_positives.shape[0]))

    # get rid of OMIM genes associated with cancer
    omim_cancer_genes = pd.read_csv('../../data/pancancer/OMIM/genemap_search_cancer.txt',
                                    sep='\t', comment='#', header=0, skiprows=3)
    # use fact that nan != nan for filtering out NaN
    sublists = [sublist for sublist in omim_cancer_genes['Gene/Locus'].str.split(',') if sublist == sublist]
    omim_cancer_geneset = [item.strip() for sublist in sublists for item in sublist]
    not_omim_not_pos = not_positives[~not_positives.Name.isin(omim_cancer_geneset)]
    if verbose:
        print ("{} genes are also not in OMIM cancer genes".format(not_omim_not_pos.shape[0]))

    # get rid of all the OMIM disease genes
    omim_genes = pd.read_csv('../../data/pancancer/OMIM/genemap2.txt', sep='\t', comment='#', header=None)
    omim_genes.columns = ['Chromosome', 'Genomic Position Start', 'Genomic Position End', 'Cyto Location',
                        'Computed Cyto Location', 'Mim Number', 'Gene Symbol', 'Gene Name',
                        'Approved Symbol', 'Entrez Gene ID', 'Ensembl Gene ID', 'Comments',
                        'Phenotypes', 'Mouse Gene Symbol/ID']
schulter commented 3 years ago

Hi, the difference is mainly that the genemap2.txt contains all OMIM disease genes while the genemap_search_cancer.txt file only contains those specifically associated with cancer. I'm not 100% sure if genemap2.txt is a superset of the cancer-specific list (which I assume it is and it should be) so I left both in. Also, I tried several different approaches for more strict and lenient filtering so it's possible that genemap_search_cancert.txt could be removed from the code.

I might have a look into this soon but currently I don't have much time.

rojinsafavi commented 3 years ago

Thanks @schulter , I think genemap_search_cancert.txt is a subset of genemap2.txt. I will close the issue now!