ropensci / software-review

rOpenSci Software Peer Review.
287 stars 104 forks source link

chemspiderapi #329

Closed RaoulWolf closed 2 years ago

RaoulWolf commented 4 years ago

Submitting Author Name: Raoul Wolf Submitting Author Github Handle: !--author1-->@RaoulWolf<!--end-author1-- Repository: https://github.com/RaoulWolf/chemspiderapi Version submitted: 0.0.2 2021-03-15 Editor: !--editor-->@jooolia<!--end-editor-- Reviewers: @rajarshi, @yufree, @data-datum

Due date for @rajarshi: 2021-03-15 Due date for @yufree: 2021-03-15 Due date for @data-datum: 2021-03-15

Archive: TBD
Version accepted: TBD


Package: chemspiderapi
Type: Package
Title: R Wrapper for ChemSpider's API Services
Version: 0.0.2
Authors@R: person("Raoul", "Wolf", email = "raoul.wolf@niva.no", role = c("aut", "cre"))
Description: ChemSpider has announced a fundamental change to the syntax of their API services in late 2018.
    This package provides convenience wrappers for the new API functionalities of ChemSpider, as well as complementary functions.
License: MIT + file LICENSE
Depends: R (>= 3.5.0)
Imports: 
    curl, 
    jsonlite
Suggests: 
    covr,
    keyring,
    knitr,
    magick,
    memoise,
    ratelimitr,
    rmarkdown,
    testthat
URL: https://github.com/RaoulWolf/chemspiderapi
Encoding: UTF-8
LazyData: true
ByteCompile: true
RoxygenNote: 6.1.1
VignetteBuilder: knitr

Scope

The chemspiderapi package is an easy-to-use R interface to use all new ChemSpider API functionalities, as introduced in ChemSpiders complete redesign of its API structure late 2018.

Researchers and citizen scientists who work on anything chemistry-related and need to routinely query against any of ChemSpider's API services.

The rOpenSci-maintained package webchem aims at offering (currently outdated / not working) functionality for accessing ChemSpider's API services. However, the redesign of ChemSpider's API structures in late 2018 broke all available functionalities.

Pre-submission enquiry: https://github.com/ropensci/software-review/issues/294 @melvidoni

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options - [x] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [x] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

annakrystalli commented 4 years ago

Hello @RaoulWolf, just in the process of assigning an editor. In the meantime, just checking, did you mean to submit your fork of the repository?

RaoulWolf commented 4 years ago

Hi @annakrystalli, yes I submitted my fork on purpose. I couldn't get an OK to incorporate AppVeyor, CodeCov and Travis CI on the other repository, so I forked it into my own account and added the functionality.

annakrystalli commented 4 years ago

OK, that's fine. I was just wondering whether further development would be done in the upstream fork or yours?

Also, we require test coverage of at least 75% before review. Your codecov badge indicates only 10% test coverage. Is there a reason for that?

RaoulWolf commented 4 years ago

Further development would be at the upstream fork, yes. It looks like I might be able to get Circle CI and CodeCov up and running on the upstream fork. Would that be preferable then?

Coverage is an issue indeed, but comes with the nature of the package. I've increased the coverage considerably (>30%), but I'm not sure how to increase it further. The main issue is that I do not want the tests to run the actual API queries because there's a rate limit on the queries. Any recommendations are more than welcome :)

EDIT: the upstream fork (https://github.com/NIVANorge/chemspiderapi) is now updated with Circle CI and CodeCov. Just let me know if you want me to change the details in the OP.

annakrystalli commented 4 years ago

Hey @RaoulWolf.

Thanks for the updates and your efforts with improving test coverage!

Regarding testing the API, have you had a look at package httptest? This should help with what you're trying to achieve.

RaoulWolf commented 4 years ago

Hey @annakrystalli,

thanks for the heads up! I'll give httptest a try asap :)

sckott commented 4 years ago

@RaoulWolf Another option for caching tests (not doing real requests) is the vcr package - but only will work if you use crul or httr instead of curl

RaoulWolf commented 4 years ago

Thanks @sckott for the heads up on vcr.

As you indicated, vcr (currently?) only works for crul or httr; also httptest does not work for curl. The reason I chose curl for the package was the relative flexibility over httr when assembling headers and data fields. In some instances I was not able to set up working API calls using httr.

I will take a further look into crul - of course I would prefer a solution where httptest or vcr would offer support for curl 😃

In the meantime, the upstream fork at NIVANorge (https://github.com/NIVANorge/chemspiderapi) has support for AppVeyor, CircleCi, Travis CI, and CodeCov. I assume it's fair game to change the repo address in the OP?

sckott commented 4 years ago

yeah curl is a great choice. jeroen is considering integration for webmockr (and therefore vcr) https://github.com/jeroen/curl/pull/174 but not sure if it will happen

RaoulWolf commented 4 years ago

Unfortunately I did not get crul to work for all possible functionalities; I have thus decided to stick with curl.

Otherwise I was able to bump up the test coverage to over 40% without using the actual API query functionalities 👍 I'm still eager to improve coverage, but I'm not sure how to at this moment...? Any help/recommendation is highly appreciated!

I will also update the OP to link to the upstream repository (https://github.com/NIVANorge/chemspiderapi).

noamross commented 4 years ago

There's also https://github.com/nealrichardson/httptest

maelle commented 4 years ago

:wave: @RaoulWolf! any update?

maelle commented 4 years ago

:wave: @RaoulWolf! any update?

maelle commented 4 years ago

@RaoulWolf, you can ask any question reg testing on https://discuss.ropensci.org :-)

noamross commented 4 years ago

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

stitam commented 4 years ago

Thank you @RaoulWolf for your work on chemspiderapi! I am one of the maintainers of webchem. I would like to note that since 2019 September webchem can also access the new ChemSpider web service (https://github.com/ropensci/webchem/issues/149), so I think there might be a package overlap issue (https://devguide.ropensci.org/policies.html#overlap). We would be very happy if this could be resolved, please contact us if you wish to dicuss it in more detail.

annakrystalli commented 4 years ago

⚠️⚠️⚠️⚠️⚠️ In the interest of reducing load on reviewers and editors as we manage the
COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned.
Reviews and responses to reviews will be handled on a 'best effort' basis, but
no follow-up reminders will be sent. Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great
appreciation for the work of our authors and reviewers. Stay healthy and take
care of one other.

The rOpenSci Editorial Board ⚠️⚠️⚠️⚠️⚠️

geanders commented 4 years ago

@RaoulWolf : I'm taking a look through to catch up on any package reviews that have stalled, and I see that we waiting to get an update from you regarding increased code coverage for tests in this package before we proceed. Our policy is to close package review issues one year after the author's last answer, so I wanted to check to see if you had any updates?

maelle commented 3 years ago

It's been more than one year after the author's last answer, so closing.

RaoulWolf commented 3 years ago

@maelle Sincerest apologies for the late reply - I haven't received any notifications regarding this submission until now! All answers since November 2019 are new to me... Oh my!

I am currently on vacation but I'm still very keen on pursuing chemspiderapi further. We have been using it routinely in our work, and last time I checked it provided more functionality than what was (is?) available over at webchem.

I haven't checked the progress with regard to possible API testing (as I haven't received notifications... again apologies!), but if you'd be willing to re-open this submission I'd be happy to do so.

maelle commented 3 years ago

Oh, in this case yep I'll re-open the issue!

stitam commented 3 years ago

Hi @RaoulWolf, I am a maintainer of webchem. Please note webchem is in heavy development, we have had three releases in the past 12 months including v1.0.0. The ChemSpider API has been fixed for about a year now and I think your package has significant overlap with webchem, only webchem provides access to other webservices as well. A paper using the fixed ChemSpider API in webchem has recently been published (https://www.jstatsoft.org/article/view/v093i13). Happy to discuss.

maelle commented 3 years ago

:wave: @RaoulWolf! Seeing that webchem wraps the ChemSpider API again I am a bit wary of overlap. Is there anything webchem does not support?

RaoulWolf commented 3 years ago

@maelle thank you for re-opening the issue! Appreciated

@stitam very valid point. I'll get back to this in about a week from now. Thanks for the heads-up! (and for the record, I'm a webchem fan 😉)

maelle commented 3 years ago

:wave: @RaoulWolf

RaoulWolf commented 3 years ago

Lengthy post ahead, so watch out!

I took some time to compare the solutions of webchem and compare them to what is offered in chemspiderapi. The first step to adress the potential big overlap is to compare the available functionality, starting with everything ChemSpider offers (https://developer.rsc.org/docs/compounds-v1-trial/1/overview), and comparing the availability between chemspiderapi and webchem (I tried to be as thorough as possible, but please forgive me if I oversaw functionalities within webchem!):

FILTERING

ChemSpider compound API chemspiderapi wrapper webchem wrapper Descirption
filter-element-post post_element() ? Search based on an element
filter-formula-batch-post post_formula_batch() ? Batch search based on formulas
filter-formula-batch-queryId-results-get get_formula_batch_queryId_results() ? Results for formula batch search
filter-formula-batch-queryId-status-get get_formula_batch_queryId_status() ? Status for formula batch search
filter-formula-post post_formula() cs_formula_csid() Search based on formula
filter-inchi-post post_inchi() cs_inchi_csid() Search based on InChI string
filter-inchikey-post post_inchikey() cs_inchikey_csid() Search based on InChIKey
filter-intrinsicproperty-post post_intrinsicproperty() ? Search based on intrinsic property
filter-mass-batch-post post_mass_batch() ? Batch search based on masses
filter-mass-batch-queryId-results-get get_mass_batch_queryId_results() ? Results for mass batch search
filter-mass-batch-queryId-status-get get_mass_batch_queryId_status() ? Status for mass batch search
filter-mass-post post_mass() ? Search based on mass
filter-name-post post_name() cs_name_csid() Search based on name
filter-queryId-results-get get_queryId_results() cs_query_csid() Results for standard search
filter-queryId-results-sdf-get get_queryId_results_sdf() ? SDF results for standard search
filter-queryId-status-get get_queryId_status() ? Status for standard search
filter-smiles-post post_smiles() cs_smiles_csid() Search based on SMILES string

LOOKUPS

ChemSpider compound API chemspiderapi wrapper webchem wrapper Description
lookups-datasources-get get_datasources() cs_datasources() A list of all available data sources

RECORDS

ChemSpider compound API chemspiderapi wrapper webchem wrapper Description
records-batch-post post_batch() ? Data for multiple ChemSpider IDs
records-recordId-details-get get_recordId_details() cs_compinfo()/cs_extcompinfo() Data for a ChemSpider ID
records-recordId-externalreferences-get get_recordId_externalreferences() ? External references for a ChemSpider ID
records-recordId-image-get get_recordId_image() cs_img() PNG image for a ChemSpider ID
records-recordId-mol-get get_recordId_mol() ? MOL file for a ChemSpider ID

TOOLS

ChemSpider compound API chemspiderapi wrapper webchem wrapper Description
tools-convert-post post_convert() cs_convert()/cs_convert_multiple() Conversion between chemical annotations
tools-validate-inchikey-post post_validate_inchikey() ? Validation of an InChIKey

While there (quite unsurprisingly) is overlap, chemspiderapi offers direct access to all API functionalities of ChemSpider.

Another major difference between the solution offered by chemspiderapi and webchem is the workflow. It's really two different philosophies in my view, so there's no right or wrong. chemspiderapi is very explicit (and maybe lengthy) in every step, but "forces" users to familiarize themselves with the desired API use workflow of ChemSpider; additionally, chemspiderapi offers several vignettes to guide users in ways how to store their API token(s), rate-limiting quires, memoising queries, and saving .mol, .sdf, or .png files. webchem has an elegant "under the hood" approach to the ChemSpider's API functionalities it offers.

Maybe less interesting, but also worth noting, is the unsurprising difference in dependencies. chemspiderapi has two dependencies, and webchem has 13 dependencies. chemspiderapi was designed to have as little dependencies as possible, and both dependencies (curl and jsonlite) are well-established and introduce no other dependencies.

While for most users the functionalities within webchem are likely enough, I still think chemspiderapi provides a more "total" ChemSpider API experience, for anyone wishing to go down this rabbit hole.

Let's discuss!

EDIT I forgot to mention that chemspiderapi also has 13 different checking functions to validate inputs (and in some cases outputs), to avoid unnecessary queries against the API. I personally found this extremely useful, e.g., when not wasting my quota on 5'000 queries because I accidentally used the wrong column of a data.frame as input.

EDIT 2 Added additional columns with a very brief description of each functionality.

maelle commented 3 years ago

Thanks a lot for your detailed answer! Could you please add a column or sentences explaining to humans what e.g. records-batch-post does i.e. the type of functionalities your package adds? I have the impression your package allows for posting information, what are the use cases for that? And what about the functionalities it adds that have nothing to do with posting?

Thanks again!

RaoulWolf commented 3 years ago

Right, apologies for not providing enough context!

The different post_*() and get_*() functionalities are named so from the HTTP methods (POST and GET, respectively).

In simple terms, all post_*() functions "upload" information into ChemSpider's API services (and usually get a query ID in return), and all get_*() functionalities "download" information from ChemSpider's API services.

In the example case you mentioned (records-batch-post), a list of up to 100 ChemSpider record IDs is POST-ed to ChemSpider's API services, and a query ID is returned.

I hope this answers your question!

maelle commented 3 years ago

Thanks, and in what kinds of workflows do you need to post information?

@stitam could you please confirm the question marks in the comparison table, i.e. that webchem does not provide that functionality?

RaoulWolf commented 3 years ago

The principle workflows for the filtering functionalities all follow the same pattern:

This is also mentioned in the README of chemspiderapi. The lookups, records and tools functionalities work directly, i.e., without the (manual) three step process.

stitam commented 3 years ago

Hi All! @RaoulWolf thanks for your detailed answer to @maelle's questions.

@RaoulWolf is right in saying that the two packages follow different philosophies and support different types or workflows. My understanding is that chemspiderapi aims to develop an R function for each ChemSpider API. On the other end, webchem focuses on the user experience, and aims to distance the user from the API, so instead of being "process" focused it is more "outcome" focused. I agree with @RaoulWolf in that there is no good or bad, these are two different approaches.

It follows from the difference in philosophies that webchem doesn't implement all APIs. The beginning of a webchem workflow is that a user is looking for a "specific" compound or set of compounds. Therefore we skipped a few APIs like filter-mass-post where the user would specify a molecular weight range and the API (actually 2-3 APIs called after eachother) would return a list of CSIDs for compounds with molecular weights falling in that range. At this point we didn't find a good enough use case for that functionality and our user haven't asked for it either.

We also didn't implement functions that process batch requests. We do see the benefit of batch requests, but not all requests have a batch alternative at the moment, so it was difficult to bind them all into a simple and easy to use user facing function. We did implement the non-batch alternatives for these queries however. ChemSpider APIs seem to be under development so once batch is available for all the queries we need (it's actualy filter-name-batch-post we are missing) then we'll add the batch options as well to make our queries more efficient.

Finally a few notes on the table above, in terms of user experience, in webchem all ChemSpider APIs that ultimately return CSIDs are actually pooled into get_csid(), that is the exported function. Also get_queryId_status is called within the functions to query the status before requesting the response itself. Other than these, yes @maelle, I can confirm the question marks, those APIs are currently not implemented in webchem so there is no overlap there at the moment.

jooolia commented 3 years ago

Editor checks:

Editor comments:

Hi @RaoulWolf

Thank you for all of the comments and answers to our questions that you have provided. We will move ahead with the onboarding process for this package. I will be the editor handling the process going forward.

── GP chemspiderapi 

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 41% of code lines are covered by test cases.

    R/FILTERING-get_formula_batch_queryId_results.R:29:NA
    R/FILTERING-get_formula_batch_queryId_results.R:31:NA
    R/FILTERING-get_formula_batch_queryId_results.R:33:NA
    R/FILTERING-get_formula_batch_queryId_results.R:35:NA
    R/FILTERING-get_formula_batch_queryId_results.R:37:NA
    ... and 414 more lines

- [x]  add a "BugReports" field to description

✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters

R/CHECKING-check_apikey.R:8:1
R/CHECKING-check_apikey.R:12:1
R/CHECKING-check_apikey.R:16:1
R/CHECKING-check_complexity.R:8:1
R/CHECKING-check_elements.R:12:1
... and 683 more lines

- [ ] check code line length (where possible)

I will look for package reviewers once these issues have been addressed. Please let me know if you have any questions or if I can clarify anything. 

Thanks! Julia
RaoulWolf commented 3 years ago

Hi @jooolia, thanks for the update! I'll try to fix the easier issues (bug reports and code line length) this week, and tackle code coverage next week. I haven't checked presser yet, thanks for the heads up! I'll let you know once the issues are addressed.

Thanks for the effort! Raoul

RaoulWolf commented 3 years ago

The BugReports field was added to the description, the README now mentions {webchem}, and I tried to minimize the line lengths as much as possible.

All changes are live at https://github.com/NIVANorge/chemspiderapi

Code coverage hasn't been addressed yet.

jooolia commented 3 years ago

Thanks for keeping us posted on your updates @RaoulWolf!

RaoulWolf commented 3 years ago

A small update regarding test coverage.

I tried implementing API tests with both {httptest} and {vcr}, but with no luck. {vcr} does not seem to support {curl} just yet (the introduction hints at future support), and {httptest} simply runs the actual queries without mocking.

I am now taking a look at {presser} and will let you know how far I get.

jooolia commented 3 years ago

Thanks for the update @RaoulWolf. If you get really stuck let us know.

maelle commented 3 years ago

{presser} supports curl indeed and that's the only HTTP testing package that does. :-)

RaoulWolf commented 3 years ago

Small update: I finally managed to wrap my head around {presser} and the first tests are running (see here, line 77). I'll now try to extend the testing functionality to the other functions. Baby steps 😃

Meanwhile, I've also added GitHub Actions as CI (trying to substitute CircleCI and Travis CI). Unfortunately the {curl} requirement keeps breaking the installation on Linux machines. My initial approach was adding the following chunk to R-CMD-check.yaml:

     - name: Install libcurl
        if: runner.os == 'Linux'
        run: |
          sudo apt-get install -y libcurl4-openssl-dev

The test then fails when (trying to) installing {chemspiderapi}:

Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/home/runner/work/_temp/Library/curl/libs/curl.so':
  /usr/lib/x86_64-linux-gnu/libcurl.so.4: version `CURL_OPENSSL_3' not found (required by /home/runner/work/_temp/Library/curl/libs/curl.so)

I also tried with libcurl4-gnutls-dev, but no luck either. Any tips warmly welcome!

maelle commented 3 years ago

Untested tip, I'd look at what other curl dependencies have in their workflows e.g. https://github.com/r-lib/httr/blob/cb4e20c9e0b38c0c020a8756db8db7a882288eaf/.github/workflows/R-CMD-check.yaml#L60

RaoulWolf commented 3 years ago

Turns out the "trick" is to not use Ubuntu 20.xx at all, but good ol' 16.04. Now all tests pass 👍

jooolia commented 3 years ago

Thanks for keeping us updated @RaoulWolf. Have you run into any other roadblocks or is it looking like {presser} will help you with the testing?

RaoulWolf commented 3 years ago

Hi @jooolia, It first looked like {presser} would help and I can definitely setup mock-APIs with it (hurray!). But I still struggle to increase coverage. I suppose it's because the API request itself happens within {chemspiderapi} functions, but I'm not sure. An example can be seen here, line 77. The tests pass, but the coverage is not increased, which is slightly frustrating. On the positive side I have now over 500 tests for the package that pass 😅

For the CI I was wondering if GitHub actions would be enough? I guess there's no need for Travis, AppVeyor or CircleCI?

maelle commented 3 years ago

Commenting on this again, hope it's fine. The presser package now has a function with which you don't need to do setup/teardown

jooolia commented 3 years ago

Hi @RaoulWolf, Yes the CI using GitHub actions should be sufficient as you have it implemented.

I am also trying to wrap my head around webfakes (previously presser now at https://r-lib.github.io/webfakes/ ) so I cannot help a lot with that aspect (but I will try as I am curious about how to do this sort of testing), however one thing that I see when I run covr::report() is that there are many arguments in the functions that are not tested in the tests so these lines are never run and thus not covered. And this is also true with the API testing, since none of the functions from the package are called in the mock API the coverage does not increase.

I will try to look a bit more at webfakes over the next few days and see if I can help.

@maelle your inputs and wisdom are always welcome. :)

jooolia commented 3 years ago

(Just putting this here for reference, there is a nice book by @maelle about http testing: https://books.ropensci.org/http-testing/packages-for-http-testing.html)

maelle commented 3 years ago

Thanks @jooolia :blue_heart: Demo of webfakes at https://github.com/ropensci-books/http-testing/pull/47 with demos of other packages.

So it seems the low coverage is not a webfakes problem, correct? I'm happy to help if needed.

jooolia commented 3 years ago

Wonderful @maelle! I think the demos will be very helpful! Thanks for pointing us to this material.

RaoulWolf commented 3 years ago

Hi @jooolia and @maelle, very cool to see {presser} become {webfakes}! I'll keep an eye out on the book 😃

I'll try a few more variants to increase coverage with {webfakes} over the weekend; maybe there's a way I haven't thought about...

RaoulWolf commented 3 years ago

I tried now several other ways to code tests so it can run with "mock" APIs using {webfakes}, but I simply cannot find a way to emulate a (hardcoded URL) API from outside the function which defines the API itself.

All other tests work well, but this seems very complicated to put in place at this point. I looked into {webchem}'s approach to this, and it seems they - apart from the very reasonable skip_on_cran() and skip_on_ci() controls - actually run real queries.

Given my current subscription with ChemSpider I'd very much like to avoid running alot of queries every time the package is updated/revised.

How do we proceed now? I'm very much open to give {webfakes} another try with some help, but otherwise I seem to be stuck in a dead-end when it comes to increasing test coverage.