saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

[GSoC2020] Basic search feature #43

Closed bscrow closed 4 years ago

bscrow commented 4 years ago

Implemented the search feature for phase 1 of GSoC

codecov[bot] commented 4 years ago

Codecov Report

Merging #43 into master will increase coverage by 12.81%. The diff coverage is 71.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #43       +/-   ##
===========================================
+ Coverage   41.93%   54.74%   +12.81%     
===========================================
  Files           5        7        +2     
  Lines        1023     1706      +683     
===========================================
+ Hits          429      934      +505     
- Misses        594      772      +178     
Impacted Files Coverage Δ
pysradb/cli.py 0.00% <0.00%> (ø)
pysradb/download.py 20.25% <12.50%> (-4.75%) :arrow_down:
pysradb/search.py 81.29% <81.29%> (ø)
pysradb/exceptions.py 100.00% <100.00%> (ø)
pysradb/sraweb.py 83.25% <0.00%> (-1.33%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update da62643...25f4c13. Read the comment docs.

saketkc commented 4 years ago

Thanks @bscrow for the latest updates! It seems to be progressing pretty good so far!

Just couple of things:

pysradb search -q "ferret" --max 1000 --db sra_geo 

however we sholud expect ~ 650 results. It is possible you are still working on it, so let me know when it is ready for review.

Great work so far!

saketkc commented 4 years ago

Also, once it is ready it would be worthwhile to check all the tests pass. It seems we have some failing tests in the test_search.py.

bscrow commented 4 years ago

Thanks @bscrow for the latest updates! It seems to be progressing pretty good so far!

Just couple of things:

  • Any search query (verbose or otherwise) should always output study_accession as the first column since it is the most useful info
  • I tried the sra_geo mode:
pysradb search -q "ferret" --max 1000 --db sra_geo 

however we sholud expect ~ 650 results. It is possible you are still working on it, so let me know when it is ready for review.

Great work so far!

I have just debugged GeoSearch, so it should work as intended now. I'll update the detailed documentation after finishing with the tests. But essentially:

pysradb search -q "ferret" --max 1000 --db sra_geo sends this query to SRA: 'ferret AND sra gds[Filter]'. sra gds[Filter] ensures that the entries in the results can be found in GEO DataSets as well.

To query GEO DataSets instead, you can instead do pysradb search --geo-query ferret --max 1000 --db sra_geo This will send the query ferret AND gds sra[Filter] to GEO DataSets. The GDS uids from the response are then converted to "related" SRA uids via ELink. This produces the same results as the "Find related data" feature on the website (shown below)

image

Website result: image

pysradb search result: image

However, I feel that my current implementation of GeoSearch may not be optimal.

I have noticed that ELink doesn't seem to retrieve the exact corresponding entries in SRA. For eg, GSE142617/SRP238838 or the 6 Experiments that it encompasses in the above search on Geo DataSets doesn't show up among the entries after the ELink conversion.

On the other hand, it is possible to find SRA entries corresponding to Geo DataSets search results by downloading the summary of both search results and then try to match accession numbers in the summaries, but I can't think of a very efficient way of doing this for queries such as "e coil" which yields many search results on both APIs

--

--

--

saketkc commented 4 years ago

Hi @bscrow, can you rebase with master (resoving the conflicts)?

Any updates on based on our previous discussion?

saketkc commented 4 years ago

I forgot to mention earlier, but we also want to support https://github.com/saketkc/pysradb/issues/38 Do you have a notebook for this?

saketkc commented 4 years ago

Hi @bscrow, would you be able to create a new PR (from a new branch) that is similar to this PR but without any writeup sections? I have reviewed it and it looks good so far, I will fix the small changes at my end.

Planning to merge it in the coming week. Thanks!

bscrow commented 4 years ago

No problems! I've created the new PR: #57

saketkc commented 4 years ago

Awesome, thanks a lot @bscrow! Closing in favor of #57