Entrez database fetching

aleferna12 commented 2 years ago

Hey guys. I have recently started to adapt a bunch of code that my lab has developed to make use of skbio instead of our own bioinformatics library. One of the features we use a lot in our projects is an implementation of the Entrez API. Would I be correct to understand that accession to this API is not yet implemented in skbio? Would it be of interest to the project to add this feature?

Why the feature would be useful:

The Entrez API is the main interface of access to the widely used NCBI sequence databases
Often fetching resources from online databases is much easier than manually downloading everything
NCBI servers are very robust and can be used reliably by applications, meaning that many scikit-bio based frameworks and aplications could benefit from it

Besides, this feature would be unusualy simple to implement, as there really is not much code to it (our whole docummented implementation is less than 200 lines of code).

wasade commented 2 years ago

Hi @aleferna12 that is correct I believe, and it would be an amazing contribution!!

aleferna12 commented 2 years ago

Ill be glad to fork the repo and work on it then! In which submodule would this feature fit best, @wasade? Maybe a new one should be created for it (and other database fetching code)?

wasade commented 2 years ago

@aleferna12 that's wonderful, thank you so much! How does skbio.db sound?

wasade commented 2 years ago

...sorry, accidentally clicked the wrong button

aleferna12 commented 2 years ago

@aleferna12 that's wonderful, thank you so much! How does skbio.db sound?

Sounds good to me. I'll update this tread once the fork is ready to be merged then

wasade commented 2 years ago

Wonderful, thank you!

aleferna12 commented 2 years ago

So, I've been working on this and the code is mostly complete and already functional. However, there are a few things I'm indecisive about and would like some input on before I waste a ton of time implementing:

The Entrez API docs don't define explicit default values for most parameters. Usually the parameters in the API are either set to a specific value or ommited (which results in the default behavior). For an example of this, see the "version" parameter of EInfo. Right now, I'm implementing any undefined defaults as "None", but maybe a type such as "EntrezDefault" would be better? Or maybe we should use a dummy guessed value such as "1.0" (in the example of EInfo "version") as to differentiate the cases in which the default value of the Entrez API is omitted from the cases in which there is no default value (such as in EInfo's "db" parameter).
This is also related to point 1. Maybe we shouldn't make every parameter from the API explicit? And instead just use **kwargs to allow the user to pass the desired arguments to the calls. I actually don't like this approach because it seems much more cryptic and lazy, but it is what other libraries such as Biopython have preferred. I've no issues implementing every parameter from each function of the API in python but I would like to know we are in the same page regarding this matter as not to waste my time.

Bellow is an example of the implementation of ESearch that shows why I'm unsure of how to approach this matter:

def esearch(db,
            term,
            usehystory=None,
            webenv=None,
            query_key=None,
            retstart=0,
            retmax=20,
            rettype="uilist",
            retmode="xml",
            sort=None,
            field=None,
            idtype=None,
            datetype=None,
            reldate=None,
            mindate=None,
            maxdate=None,
            parse_mode="utf8",
            max_tries=2,
            interval=5):
    """Placeholder for an absurd amount of documentation copied from the Entrez website describing each parameter"""
    parameters = {
        "db": db,
        "term": term,
        "usehystory": usehystory,
        "webenv": webenv,
        "query_key": query_key,
        "retstart": retstart,
        "retmax": retmax,
        "rettype": rettype,
        "retmode": retmode,
        "sort": sort,
        "field": field,
        "idtype": idtype,
        "datetype": datetype,
        "reldate": reldate,
        "mindate": mindate,
        "maxdate": maxdate
    }
    # Only this bit of code would actually be required if we omitted the parameters and used **kwargs instead  
    if type(mindate) != type(maxdate):
        raise ValueError("'mindate' and 'maxdate' parameters must be either both defined or both"
                         "undefined")
    return _make_entrez_request("esearch.fcgi?",
                                parameters,
                                parse_mode=parse_mode,
                                max_tries=max_tries,
                                interval=interval)

What do you reckon @wasade? Eager to hear your opinions on this.

wasade commented 2 years ago

That's great @aleferna12! I think it would be better to enumerate the arguments rather than kwargs, with reasonable defaults if listed in Entrez, and leave as None otherwise (unless there are realistic defaults)? I realize this is more typing, but as you note, it the kwargs is a bit more cryptic to track down what needs to be done.

If pulling the text for the arguments from another source, would it be possible to make sure the attribution is explicit in the Notes section, and to include a References block for users to find more detail?

aleferna12 commented 2 years ago

Nice! I agree. I think that to avoid confusion it would be best to default parameters to None unless they are explicitly defined in the API. I'll make sure to include links to the Entrez docs in each function and in the module docstring as well.

wasade commented 2 years ago

Thank you!

aleferna12 commented 2 years ago

Hey! How are we going to deploy tests for this module? I currently have basic tests that compare the request url resulting from a function call to an expected url taken from entrez docs examples. More testing than that would have to be performed online by actually making sending requests to the server I suppose... Is this a problem?

wasade commented 2 years ago

If the requests are small, and effectively assured to remain valid, and the suite issues a relatively small number of them then I think it is fine. Another thing done in this space is to use patch or mock from the unittest module.

aleferna12 commented 2 years ago

Hey, sorry to bother again. Ive resumed working on this issue and just want to confirm that the tests I have developed are significant since I didnt have much prior experience with the unittest module.

This is the approach I came up with (the implementation of some functions was omitted for clarity).

class TestOffline(unittest.TestCase):
    """Set os tests that can be performed offline.

    These tests ensure a correspondence between the URL generated by the entrez functions and the
    expected tested URL. They also test if `entrez._parse_response` can correctly convert between
    types (doesn't test if the content of the response is correct though).
    """
    def test_einfo_no_args(self):
        # Because the default arguments must be included in the URL, a parameterized URL is
        # inevitable even if we call 'entrez.einfo()' with no arguments
        expected_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=xml"

        with urlopen_patch() as patch:
            result = entrez.einfo()

        self.assertEqual(expected_url, get_url_from_patch(patch))
        self.assertIsInstance(result, str)

class TestOnline(unittest.TestCase):
    """Set of tests that can only be performed online.

    These tests compare the output of an actual call of entrez functions against pre-obtained files
    that represent the expected result. Both the retrieval of text data and further parsing by the
    `entrez._parse_response` are tested.

    Presumes there is access to the internet and that the Entrez API is stable enough to always
    return the same response (it should be).
    """
    def test_einfo_retmode_json(self):
        einfo_result = entrez.einfo(retmode='json', parse_mode='json')
        with open(get_data_path("einfo_retmode_json.json")) as file:
            expected_result = json.load(file)

        self.assertEqual(einfo_result, expected_result)

What do you think about this? It's the last thing on the checklist before I submit the pull request.

wasade commented 2 years ago

Thanks! I think this makes sense. For the online tests, it would be advantageous to test specific fields that should never change, rather than testing for exact equality on the assumption the external resource may itself fluctuate in ways that we aren't concerned about

aleferna12 commented 2 years ago

Hey! Sorry that I haven't updated this in a while. I submitted the pull request just now, if anything needs revisions let me know.

scikit-bio / scikit-bio

Entrez database fetching #1799

Why the feature would be useful: