saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

[BUG] EnaSearch failing depending on search term ordering #154

Closed xapple closed 2 years ago

xapple commented 2 years ago

Describe the bug The EnaSearch sometimes returns a JSONDecodeError depending on what search terms were entered.

To Reproduce Here are two examples, one where it works, one where it doesn't. The search term is 'human cancer' in the first and 'cancer human' in the second.

In [14]: from pysradb.search import EnaSearch
    ...: instance = EnaSearch(query="human cancer")
    ...: instance.search()
    ...: df = instance.get_df()
    ...: print(df)

   study_accession experiment_accession  ... read_count  base_count
0       PRJEB29129           ERX2841282  ...   29093158  4422160016
1       PRJEB29129           ERX2841283  ...   29487672  4482126144
2       PRJEB29129           ERX2841284  ...   33709841  5123895832
3       PRJEB29129           ERX2841285  ...   31243280  4748978560
4       PRJEB29129           ERX2841286  ...   30921754  4700106608
5       PRJEB29129           ERX2841287  ...   31732928  4823405056
6       PRJEB41842           ERX4791105  ...   19954200  1006152240
7       PRJEB41842           ERX4791106  ...   17384713   876537479
8       PRJEB41842           ERX4791107  ...   17246780   869313888
9       PRJEB41842           ERX4791108  ...   15413300   776685200
10      PRJEB41842           ERX4791109  ...   19842185  1000033247
11      PRJEB41842           ERX4791110  ...   16730305   842470774
12      PRJEB41842           ERX4791111  ...   17644294   889720825
13      PRJEB41842           ERX4791112  ...   17741468   893920492
14      PRJEB41842           ERX4791113  ...   18842441   949579581
15      PRJEB41842           ERX4791114  ...   21788717  1097976461
16      PRJEB41842           ERX4791115  ...   19482294   981684676
17      PRJEB41842           ERX4791116  ...   17348059   874294274
18      PRJEB41842           ERX4791117  ...   18328392   923814507
19      PRJEB41842           ERX4791118  ...   18031704   908178450

[20 rows x 15 columns]

In [15]: from pysradb.search import EnaSearch
    ...: instance = EnaSearch(query="cancer human")
    ...: instance.search()
    ...: df = instance.get_df()
    ...: print(df)

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
~/Library/Python/3.9/lib/python/site-packages/requests/models.py in json(self, **kwargs)
    909         try:
--> 910             return complexjson.loads(self.text, **kwargs)
    911         except JSONDecodeError as e:

/usr/local/lib/python3.9/site-packages/simplejson/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
    524             and not use_decimal and not kw):
--> 525         return _default_decoder.decode(s)
    526     if cls is None:

/usr/local/lib/python3.9/site-packages/simplejson/decoder.py in decode(self, s, _w, _PY3)
    369             s = str(s, self.encoding)
--> 370         obj, end = self.raw_decode(s)
    371         end = _w(s, end).end()

/usr/local/lib/python3.9/site-packages/simplejson/decoder.py in raw_decode(self, s, idx, _w, _PY3)
    399                 idx += 3
--> 400         return self.scan_once(s, idx=_w(s, idx).end())

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-15-a323eb3971e7> in <module>
      2
      3 instance = EnaSearch(query="cancer human")
----> 4 instance.search()
      5 df = instance.get_df()
      6 print(df)

~/Library/Python/3.9/lib/python/site-packages/pysradb/search.py in search(self)
   1265             )
   1266             r.raise_for_status()
-> 1267             self._format_result(r.json())
   1268         except requests.exceptions.Timeout:
   1269             sys.exit(f"Connection to the server has timed out. Please retry.")

~/Library/Python/3.9/lib/python/site-packages/requests/models.py in json(self, **kwargs)
    915                 raise RequestsJSONDecodeError(e.message)
    916             else:
--> 917                 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    918
    919     @property

JSONDecodeError: [Errno Expecting value] : 0

Desktop (please complete the following information):

saketkc commented 2 years ago

Thanks for catching this! I can confirm that this is a bug. cc @bscrow

bscrow commented 2 years ago

I could not reproduce this bug in Python 3.7 (see colab example) where the query "cancer human" returns no search results with a message as expected:

No results found for the following search query: {'query': 'CANCER HUMAN', 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': None, 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}

The traceback suggests an error when calling r.json() on an empty query result. Can I check what version of requests library are you using?

xapple commented 2 years ago

Version '2.27.1' which is the latest at time of writing. But I'm on python 3.9.9, so I'm going to guess something changed with the built-in json module between versions.

saketkc commented 2 years ago

I just tried this with Python 3.10.2 and requests 2.27.1 and was unable to replicate as well:

$ python
Python 3.10.2 | packaged by conda-forge | (main, Feb  1 2022, 19:29:00) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pysradb.search import EnaSearch
>>> instance = EnaSearch(query="cancer human")
>>> instance.search()
No results found for the following search query:
 {'query': 'CANCER HUMAN', 'accession': None, 'organism': None, 'layout': None, 'mbases': None, 'publication_date': None, 'platform': None, 'selection': None, 'source': None, 'strategy': None, 'title': None}
>>> df = instance.get_df()
>>> print(df)
Empty DataFrame
Columns: []
Index: []
saketkc commented 2 years ago

Can you try again @xapple? Not sure if it might be due to an intermittent connectivity issue?

xapple commented 2 years ago

I installed and upgraded the latest pythons on my computer:

I can confirm that the problem is present only in 3.9 series and not in 3.10

saketkc commented 2 years ago

Indeed, we see this on the tests also: https://github.com/saketkc/pysradb/runs/5465673995?check_suite_focus=true#step:7:76. But looks like a recent problem (like a couple of weeks old), since the earlier tests pass with 3.9: https://github.com/saketkc/pysradb/runs/5309144937?check_suite_focus=true