saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
310 stars 51 forks source link

[BUG] ValueError when using SraSearch to query #88

Closed kpj closed 3 years ago

kpj commented 3 years ago

Describe the bug Using SraSearch with verbosity>=2 and a large query raises a ValueError when setting self.df["pmid"] = list(uids) (https://github.com/saketkc/pysradb/blob/c23d4a769543d05a0f002d1b28c985da5963573f/pysradb/search.py#L776) because the size of the underlying dataframe seems to vary.

The following error is raised:

ValueError                                Traceback (most recent call last)
<ipython-input-1-d19c33e66199> in <module>
      1 instance = SraSearch(verbosity=3, return_max=max_query_num, query=query, platform='illumina')
----> 2 instance.search()
      3 df_search = instance.get_df()

/pysradb/search.py in search(self)
    774             self._format_result()
    775             if self.verbosity >= 2:
--> 776                 self.df["pmid"] = list(uids)
    777         except requests.exceptions.Timeout:
    778             sys.exit(f"Connection to the server has timed out. Please retry.")

/pandas/core/frame.py in __setitem__(self, key, value)
   3038         else:
   3039             # set column
-> 3040             self._set_item(key, value)
   3041
   3042     def _setitem_slice(self, key: slice, value):

/pandas/core/frame.py in _set_item(self, key, value)
   3114         """
   3115         self._ensure_valid_index(value)
-> 3116         value = self._sanitize_column(key, value)
   3117         NDFrame._set_item(self, key, value)
   3118

/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3762
   3763             # turn me into an ndarray
-> 3764             value = sanitize_index(value, self.index)
   3765             if not isinstance(value, (np.ndarray, Index)):
   3766                 if isinstance(value, list) and len(value) > 0:

/pandas/core/internals/construction.py in sanitize_index(data, index)
    745     """
    746     if len(data) != len(index):
--> 747         raise ValueError(
    748             "Length of values "
    749             f"({len(data)}) "

ValueError: Length of values (86768) does not match length of index (86721)

Multiple runs yield slightly different error messages:

ValueError: Length of values (86768) does not match length of index (86760)

It seems like the index length is varying for some reason.

To Reproduce Execute the following code:

from pysradb.search import SraSearch

max_query_num = 1_000_000
query = 'txid2697049[Organism:noexp] AND ("filetype cram"[Properties] OR "filetype bam"[Properties] OR "filetype fastq"[Properties])'

instance = SraSearch(verbosity=2, return_max=max_query_num, query=query, platform='illumina')
instance.search()
df_search = instance.get_df()

Desktop:

saketkc commented 3 years ago

Thanks for the bug report @kpj! @bscrow would you be able to take a look at this?

bscrow commented 3 years ago

I wasn't able to reproduce the bug; here's an example: https://colab.research.google.com/drive/15cRhVw7Cy86N4JWWWtHfyJgr0mtwImdO?usp=sharing

After discussing with @saketkc, one possible reason for the bug would be that when the query was being processed, some of the entries were in the process of being submitted to SRA, such that while the pmid is returned in the search result, some of the corresponding metadata is not yet ready and could not be retrieved.

Either way, I think that I could have handled the error in a better way. Will make a pr soon

kpj commented 3 years ago

Thanks a lot for taking a look!

You make an interesting point, I just tried it again and it worked fine. It might indeed be the case that some samples were added during the query.

Maybe it would make sense to retrieve the UIDs after iterating through all accessions instead of doing it before (and making sure they are in the right order, etc).

bscrow commented 3 years ago

UIDs are not found in the full XML file from SRA, so I don't see an efficient way to check that the UIDs are in the right order. I feel that maybe a better idea would be to store the UID list separately (#92 )