jrderuiter / pybiomart

A simple pythonic interface to biomart.
MIT License
53 stars 11 forks source link

Error on filter with many items (~100 elements) passed #9

Open ivirshup opened 5 years ago

ivirshup commented 5 years ago

Basically, you can't pass many values for filters. I'll give an example of what I mean. Here's the setup:

from pybiomart import Server
server = Server("www.ensembl.org")
dataset = (server.marts["ENSEMBL_MART_ENSEMBL"] 
                 .datasets["hsapiens_gene_ensembl"])
# List of filter values can't be np.arrays or pd.Series, so I'm converting to a list
gene_ids = list(dataset.query(attributes=["ensembl_gene_id"], use_attr_names=True)["ensembl_gene_id"])                                                

Now, making queries with filters:

# This works fine
>>> dataset.query(attributes=["ensembl_gene_id", "hgnc_symbol"], 
                  filters={"ensembl_gene_id": gene_ids[:5]})
    Gene stable ID HGNC symbol
0  ENSG00000000003      TSPAN6
1  ENSG00000000005        TNMD
2  ENSG00000000419        DPM1
3  ENSG00000000457       SCYL3
4  ENSG00000000460    C1orf112

# This throws an error:

>>> dataset.query(attributes=["ensembl_gene_id", "hgnc_symbol"], 
                  filters={"ensembl_gene_id": gene_ids[:100]})
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-23-af811b67d423> in <module>
      1 dataset.query(attributes=["ensembl_gene_id", "hgnc_symbol"], 
----> 2               filters={"ensembl_gene_id": gene_ids[:100]})

~/github/pybiomart/src/pybiomart/dataset.py in query(self, attributes, filters, only_unique, use_attr_names, dtypes)
    287 
    288         # Fetch response.
--> 289         response = self.get(query=ElementTree.tostring(root))
    290 
    291         # Raise exception if an error occurred.

~/github/pybiomart/src/pybiomart/base.py in get(self, **params)
    109             with requests_cache.disabled():
    110                 r = requests.get(self.url, params=params)
--> 111         r.raise_for_status()
    112         return r
    113 

/usr/local/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
    938 
    939         if http_error_msg:
--> 940             raise HTTPError(http_error_msg, response=self)
    941 
    942     def close(self):

HTTPError: 502 Server Error: Bad Gateway for url: http://www.ensembl.org:80/biomart/martservice?query=%3CQuery+datasetConfigVersion%3D%220.6%22+formatter%3D%22TSV%22+header%3D%221%22+uniqueRows%3D%221%22+virtualSchemaName%3D%22default%22%3E%3CDataset+interface%3D%22default%22+name%3D%22hsapiens_gene_ensembl%22%3E%3CAttribute+name%3D%22ensembl_gene_id%22+%2F%3E%3CAttribute+name%3D%22hgnc_symbol%22+%2F%3E%3CFilter+name%3D%22ensembl_gene_id%22+value%3D%22ENSG00000000003%2CENSG00000000005%2CENSG00000000419%2CENSG00000000457%2CENSG00000000460%2CENSG00000000938%2CENSG00000000971%2CENSG00000001036%2CENSG00000001084%2CENSG00000001167%2CENSG00000001460%2CENSG00000001461%2CENSG00000001497%2CENSG00000001561%2CENSG00000001617%2CENSG00000001626%2CENSG00000001629%2CENSG00000001630%2CENSG00000001631%2CENSG00000002016%2CENSG00000002079%2CENSG00000002330%2CENSG00000002549%2CENSG00000002586%2CENSG00000002587%2CENSG00000002726%2CENSG00000002745%2CENSG00000002746%2CENSG00000002822%2CENSG00000002834%2CENSG00000002919%2CENSG00000002933%2CENSG00000003056%2CENSG00000003096%2CENSG00000003137%2CENSG00000003147%2CENSG00000003249%2CENSG00000003393%2CENSG00000003400%2CENSG00000003402%2CENSG00000003436%2CENSG00000003509%2CENSG00000003756%2CENSG00000003987%2CENSG00000003989%2CENSG00000004059%2CENSG00000004139%2CENSG00000004142%2CENSG00000004399%2CENSG00000004455%2CENSG00000004468%2CENSG00000004478%2CENSG00000004487%2CENSG00000004534%2CENSG00000004660%2CENSG00000004700%2CENSG00000004766%2CENSG00000004776%2CENSG00000004777%2CENSG00000004779%2CENSG00000004799%2CENSG00000004809%2CENSG00000004838%2CENSG00000004846%2CENSG00000004848%2CENSG00000004864%2CENSG00000004866%2CENSG00000004897%2CENSG00000004939%2CENSG00000004948%2CENSG00000004961%2CENSG00000004975%2CENSG00000005001%2CENSG00000005007%2CENSG00000005020%2CENSG00000005022%2CENSG00000005059%2CENSG00000005073%2CENSG00000005075%2CENSG00000005100%2CENSG00000005102%2CENSG00000005108%2CENSG00000005156%2CENSG00000005175%2CENSG00000005187%2CENSG00000005189%2CENSG00000005194%2CENSG00000005206%2CENSG00000005238%2CENSG00000005243%2CENSG00000005249%2CENSG00000005302%2CENSG00000005339%2CENSG00000005379%2CENSG00000005381%2CENSG00000005421%2CENSG00000005436%2CENSG00000005448%2CENSG00000005469%2CENSG00000005471%22+%2F%3E%3C%2FDataset%3E%3C%2FQuery%3E

biomaRt gets around this by automatically splitting a query into multiple parts (if it's too big), and combining the results. I'd be happy to implement this in a PR, but would like to hear that a maintainer would be here to merge it.

kpj commented 4 years ago

I also came across this problem and implemented a basic solution: https://github.com/kpj/pybiomart

So far it only works if at most one filter is provided, because I wasn't sure how to split multiple long filter lists (can they interact. If yes, how?).