jrderuiter / pybiomart

A simple pythonic interface to biomart.
MIT License
53 stars 11 forks source link

NAN's lead entrez ids to be considered as floats #8

Open samwindels opened 5 years ago

samwindels commented 5 years ago

Hi,

I perform the following call to map two genes to entrez gene ids.

dataset = (server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']) dataset.query(attributes=['ensembl_gene_id', 'entrezgene'], filters={'link_ensembl_gene_id': ['ENSG00000285363','ENSG00000285114']})

As a result I get:

Gene stable ID NCBI gene ID 0 ENSG00000285114 56169.0 1 ENSG00000285363 NaN

What happens is that, because ENSG00000285363 does not have a known mapping in NCBI, the entire column get's listed as floats. This is troublesome as now I can't know if ENSG00000285114 maps to 56169 or 561690.

Regards,

Sam

ivirshup commented 5 years ago

@samwindels, just a comment on how the parsing is working, ENSG00000285114 maps to 56169. If it mapped to 561690 you'd get a float that looks like: 561690.0.

Here's an example:

In [5]: dataset.query(attributes=['ensembl_gene_id', 'entrezgene'], filters={'link_ensembl_gene_id':
   ...:  ['ENSG00000099725','ENSG00000185115', 'ENSG00000285363']})                                 
Out[5]: 
    Gene stable ID  NCBI gene ID
0  ENSG00000099725        5616.0
1  ENSG00000185115       56160.0
2  ENSG00000285363           NaN

As a work around, you should be able to get to the correct identifiers (as strings) with:

result["entrezgene"].apply(lambda x: "{:.0f}".format(x))