saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

Search feature only #57

Closed bscrow closed 3 years ago

bscrow commented 4 years ago

This PR contains the same set of changes in #43, minus the writeup files

codecov[bot] commented 4 years ago

Codecov Report

Merging #57 (ec8cb49) into master (26395b6) will increase coverage by 12.91%. The diff coverage is 70.22%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #57       +/-   ##
===========================================
+ Coverage   41.17%   54.08%   +12.91%     
===========================================
  Files           5        7        +2     
  Lines        1059     1773      +714     
===========================================
+ Hits          436      959      +523     
- Misses        623      814      +191     
Impacted Files Coverage Δ
pysradb/cli.py 0.00% <0.00%> (ø)
pysradb/download.py 22.22% <20.68%> (-2.78%) :arrow_down:
pysradb/search.py 79.81% <79.81%> (ø)
pysradb/exceptions.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 26395b6...ec8cb49. Read the comment docs.

saketkc commented 4 years ago

This is super awesome @bscrow! Many thanks for your contribution and for the awesome work you have done over GSoC2020! I believe this will be a huge help to a lot of researchers!

I have left some comments, most of them are minor. It would be great if they can be addressed. Of all, it is particularly important we output all the URLs rather than selecting the best one ourselves.

Great work!

cc @mvdbeek @amalthomas111

amalthomas111 commented 3 years ago

When using -g it might be a good idea to have a dynamic naming prefix/suffix for the plots. Could use time stamps. Otherwise, plots would be overwritten.

amalthomas111 commented 3 years ago

For -G, -Y, -Z options it would be great if you could create a file in GitHub or locally which users can refer to, compiling possible options for each of these tags. In the help (-h) options, you can refer to the link of this file or local path.

amalthomas111 commented 3 years ago
pysradb search  -q "single-cell RNA-seq" -g  -D  01-01-2008:01-10-2020

This command does not work. Gives the error: ValueError: bins must be positive, when an integer

amalthomas111 commented 3 years ago

pysradb search -d geo -q "single-cell RNA-seq" -m 10K -o test_1ksc

First status bar was showing me 0/100000 [00:00<?, ?it/s]. I mentioned 10K, not 100K. When I mentioned 100, it is showing 1K, a factor of 10 is more. I think this happens with -d geo, not with sra. For both -m = 100 and 10K, I got a connection error: http.client.RemoteDisconnected: Remote end closed connection without response. During handling of the above exception, another exception occurred: Is it NCBI issue?

I am getting connection/operation time out for almost all m > 100 for db=geo/sra. Need to look into this!

bscrow commented 3 years ago
pysradb search  -q "single-cell RNA-seq" -g  -D  01-01-2008:01-10-2020

This command does not work. Gives the error: ValueError: bins must be positive, when an integer

I've resolved this bug as well as a related bug when the query returns entries without base count - Since none of the entries contains information about base count, the number of bins for a base count histogram will cause the error.

When using -g it might be a good idea to have a dynamic naming prefix/suffix for the plots. Could use time stamps. Otherwise, plots would be overwritten.

Thanks for the suggestion! I've implemented it in the new commit

bscrow commented 3 years ago

pysradb search -d geo -q "single-cell RNA-seq" -m 10K -o test_1ksc

First status bar was showing me 0/100000 [00:00<?, ?it/s]. I mentioned 10K, not 100K. When I mentioned 100, it is showing 1K, a factor of 10 is more. I think this happens with -d geo, not with sra. For both -m = 100 and 10K, I got a connection error: http.client.RemoteDisconnected: Remote end closed connection without response. During handling of the above exception, another exception occurred: Is it NCBI issue?

I am getting connection/operation time out for almost all m > 100 for db=geo/sra. Need to look into this!

I've debugged the issue of retrieving 10X entries from SRA.

As for the connection error, I couldn't replicate the error on my side except by running another pysradb search operation from the same IP address while the above process is running, in which case NCBI terminated my connection for exceeding their API limit. Can I check if this is the case when you tested pysradb search?

bscrow commented 3 years ago

For -G, -Y, -Z options it would be great if you could create a file in GitHub or locally which users can refer to, compiling possible options for each of these tags. In the help (-h) options, you can refer to the link of this file or local path.

I've added a short guide for these tags as well as for queries GEO DataSets. This can be accessed via command line using pysradb search --geo-info or by calling GeoSearch.info() on python.

21e1c1f

saketkc commented 3 years ago

@bscrow a failing example: https://colab.research.google.com/drive/1hN6m7kJ4Xpflvde3wK12Ubzu_Aq3x3qX?usp=sharing

bscrow commented 3 years ago

@bscrow a failing example: https://colab.research.google.com/drive/1hN6m7kJ4Xpflvde3wK12Ubzu_Aq3x3qX?usp=sharing

I've added a check for no search results in 797da82, which should resolve the error message. In order to generate statistics however, instance.search() must be called first to retrieve search results.

https://colab.research.google.com/drive/1pCmfj-uUDpnBFCXCZoiBw-k82Pi12Otu?usp=sharing

bscrow commented 3 years ago

My updated documentation for pysradb search is in the pull request #51 a live version can be currently found on https://bscrow.github.io/pysradb/commands/search.html