Open aleferna12 opened 2 years ago
Hi @aleferna12 that is correct I believe, and it would be an amazing contribution!!
Ill be glad to fork the repo and work on it then! In which submodule would this feature fit best, @wasade? Maybe a new one should be created for it (and other database fetching code)?
@aleferna12 that's wonderful, thank you so much! How does skbio.db
sound?
...sorry, accidentally clicked the wrong button
@aleferna12 that's wonderful, thank you so much! How does
skbio.db
sound?
Sounds good to me. I'll update this tread once the fork is ready to be merged then
Wonderful, thank you!
So, I've been working on this and the code is mostly complete and already functional. However, there are a few things I'm indecisive about and would like some input on before I waste a ton of time implementing:
Bellow is an example of the implementation of ESearch that shows why I'm unsure of how to approach this matter:
def esearch(db,
term,
usehystory=None,
webenv=None,
query_key=None,
retstart=0,
retmax=20,
rettype="uilist",
retmode="xml",
sort=None,
field=None,
idtype=None,
datetype=None,
reldate=None,
mindate=None,
maxdate=None,
parse_mode="utf8",
max_tries=2,
interval=5):
"""Placeholder for an absurd amount of documentation copied from the Entrez website describing each parameter"""
parameters = {
"db": db,
"term": term,
"usehystory": usehystory,
"webenv": webenv,
"query_key": query_key,
"retstart": retstart,
"retmax": retmax,
"rettype": rettype,
"retmode": retmode,
"sort": sort,
"field": field,
"idtype": idtype,
"datetype": datetype,
"reldate": reldate,
"mindate": mindate,
"maxdate": maxdate
}
# Only this bit of code would actually be required if we omitted the parameters and used **kwargs instead
if type(mindate) != type(maxdate):
raise ValueError("'mindate' and 'maxdate' parameters must be either both defined or both"
"undefined")
return _make_entrez_request("esearch.fcgi?",
parameters,
parse_mode=parse_mode,
max_tries=max_tries,
interval=interval)
What do you reckon @wasade? Eager to hear your opinions on this.
That's great @aleferna12! I think it would be better to enumerate the arguments rather than kwargs
, with reasonable defaults if listed in Entrez, and leave as None
otherwise (unless there are realistic defaults)? I realize this is more typing, but as you note, it the kwargs
is a bit more cryptic to track down what needs to be done.
If pulling the text for the arguments from another source, would it be possible to make sure the attribution is explicit in the Notes
section, and to include a References
block for users to find more detail?
Nice! I agree. I think that to avoid confusion it would be best to default parameters to None
unless they are explicitly defined in the API. I'll make sure to include links to the Entrez docs in each function and in the module docstring as well.
Thank you!
Hey! How are we going to deploy tests for this module? I currently have basic tests that compare the request url resulting from a function call to an expected url taken from entrez docs examples. More testing than that would have to be performed online by actually making sending requests to the server I suppose... Is this a problem?
If the requests are small, and effectively assured to remain valid, and the suite issues a relatively small number of them then I think it is fine. Another thing done in this space is to use patch
or mock
from the unittest
module.
Hey, sorry to bother again. Ive resumed working on this issue and just want to confirm that the tests I have developed are significant since I didnt have much prior experience with the unittest module.
This is the approach I came up with (the implementation of some functions was omitted for clarity).
class TestOffline(unittest.TestCase):
"""Set os tests that can be performed offline.
These tests ensure a correspondence between the URL generated by the entrez functions and the
expected tested URL. They also test if `entrez._parse_response` can correctly convert between
types (doesn't test if the content of the response is correct though).
"""
def test_einfo_no_args(self):
# Because the default arguments must be included in the URL, a parameterized URL is
# inevitable even if we call 'entrez.einfo()' with no arguments
expected_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=xml"
with urlopen_patch() as patch:
result = entrez.einfo()
self.assertEqual(expected_url, get_url_from_patch(patch))
self.assertIsInstance(result, str)
class TestOnline(unittest.TestCase):
"""Set of tests that can only be performed online.
These tests compare the output of an actual call of entrez functions against pre-obtained files
that represent the expected result. Both the retrieval of text data and further parsing by the
`entrez._parse_response` are tested.
Presumes there is access to the internet and that the Entrez API is stable enough to always
return the same response (it should be).
"""
def test_einfo_retmode_json(self):
einfo_result = entrez.einfo(retmode='json', parse_mode='json')
with open(get_data_path("einfo_retmode_json.json")) as file:
expected_result = json.load(file)
self.assertEqual(einfo_result, expected_result)
What do you think about this? It's the last thing on the checklist before I submit the pull request.
Thanks! I think this makes sense. For the online tests, it would be advantageous to test specific fields that should never change, rather than testing for exact equality on the assumption the external resource may itself fluctuate in ways that we aren't concerned about
Hey! Sorry that I haven't updated this in a while. I submitted the pull request just now, if anything needs revisions let me know.
Hey guys. I have recently started to adapt a bunch of code that my lab has developed to make use of skbio instead of our own bioinformatics library. One of the features we use a lot in our projects is an implementation of the Entrez API. Would I be correct to understand that accession to this API is not yet implemented in skbio? Would it be of interest to the project to add this feature?
Why the feature would be useful:
Besides, this feature would be unusualy simple to implement, as there really is not much code to it (our whole docummented implementation is less than 200 lines of code).