saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

Api key #61

Closed Maarten-vd-Sande closed 4 years ago

Maarten-vd-Sande commented 4 years ago

I decided to just implement the skeleton instead of leaving it up to you, see #60. This still needs the sleeps changed, but I am not sure which sleep is for what, since some are 1/10th, 1/3rd, 1/2 etc.

import pysradb
print(pysradb.__version__)
print(pysradb.SRAweb(api_key="__secret__").sra_metadata("GSM1020644", detailed=True).experiment_alias.values)

works and does not crash (and takes the same amount of time as without a key).

codecov[bot] commented 4 years ago

Codecov Report

Merging #61 into master will decrease coverage by 0.01%. The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #61      +/-   ##
==========================================
- Coverage   42.23%   42.22%   -0.02%     
==========================================
  Files           5        5              
  Lines        1030     1035       +5     
==========================================
+ Hits          435      437       +2     
- Misses        595      598       +3     
Impacted Files Coverage Δ
pysradb/sraweb.py 84.12% <66.66%> (-0.48%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5e54c4e...aeaa494. Read the comment docs.

saketkc commented 4 years ago

Thanks @Maarten-vd-Sande , this looks great and is very helpful! I have not revisited the sleep time issues for a while given things looked mostly stable. But this PR should be a good opportunity to do so.

Might be helpful for @bscrow's PR #57 too. I would need some time to review this.

Maarten-vd-Sande commented 4 years ago

Glad that it is considered useful :smile: . Let me know if I can somehow help to move this forward.

saketkc commented 4 years ago

Do you think we should propogate the self.sleep_time in the retry functions? It is not being used anyhwere. One idea would be to not have it and let the retry block as is (it has a sleep of 0.5 seconds by default)

Maarten-vd-Sande commented 4 years ago

I think the API key is not really necessary actually. I only used SRAweb with detailed=True, and I thought it took so "long" because pysradb looked up each sample separately (which it doesn't). I just discovered that the reason it takes so long is checking for the ENA download link.

I changed the time.sleeps to use the self.time_sleep, so at least that is cleaned up. This works and the small tests I did do not run into API limits. However the api-key is not really necessary, so we can also close this PR and ignore this :)

bscrow commented 4 years ago

Yep @saketkc including an API key for the search feature should speed up large queries for SRA metadata, since the current implementation retrieves metadata in batches of 300 entries. I'll test it out for the search module

Maarten-vd-Sande commented 4 years ago

Seems like I keep on missing stuff in the code :eyes:. Where is this split (batches of 300) made @bscrow ?

bscrow commented 4 years ago

@Maarten-vd-Sande this split is only implemented in the upcoming pysradb search feature (#57 ), which is different from the pysradb metadata feature found in pysradb/sraweb.py in that it fetches metadata based on text queries instead of accession numbers

saketkc commented 4 years ago

I guess, this is still useful if someone is interested in trying out the API key and the current behavior remains unaffected. I might expose this to command line in the future.

What are your thoughts @Maarten-vd-Sande and @bscrow?

Maarten-vd-Sande commented 4 years ago

It doesn't hurt to have it, although it might be misleading in that it speeds up pysradb with "normal" usage. If it helps with the search function then that's nice!

I have to say it requires more thorough testing to see if it hits the API limit, since I haven't really stress-tested it.

saketkc commented 4 years ago

I decided to merge it. In my experience, stress testing NCBI is hard (very unreliable behavior overall). Since it doesn't change the current behavior, and might potentially be useful (hopefully), we can keep it.

saketkc commented 4 years ago

Many thanks for your contribution @Maarten-vd-Sande!

Maarten-vd-Sande commented 3 years ago

Just a FYI, every once in a while I get:

Unable to parse xml: {"error":"API rate limit exceeded","api-key":"131.174.27.98","count":"4","limit":"3"}

Maybe the sleep time should be slightly increased to avoid this.

saketkc commented 3 years ago

Thanks for catching this @Maarten-vd-Sande. Do you have an example I can test this against?

Maarten-vd-Sande commented 3 years ago

Let me see if I can cook something up :)