Closed whedon closed 4 years ago
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @jessicaaustin, @evanleeturner it looks like you're currently assigned to review this paper :tada:.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Reference check summary:
OK DOIs
- 10.1016/j.mio.2014.09.001 is OK
MISSING DOIs
- None
INVALID DOIs
- http://dx.doi.org/10.13155/33951 is INVALID because of 'https://doi.org/' prefix
Hi everyone. It's three weeks since the start of the review. Would you have any update for me? Thanks.
I'll just do a specific ping to reviewers @jessicaaustin and @evanleeturner. When do you think you'll be able to work on your reviews? Thanks.
I am still working on this review, thank you.
Thanks for the reminder. I have read all the instructions and made some progress on my review before getting pulled into other tasks. I will pick it up again this week and submit by the end of the week.
I do not believe this is a conflict of interest as outlined in the JOSS guidelines, but I do want to state that I am a maintainer of a similar open-source QC library, https://github.com/ioos/ioos_qc .
I agree 100% with the author's philosophy that QC procedures should be organized and distributed as code packages in order to widely and consistently used. I also agree with the goal of creating a framework that can run tests from multiple different QC test schemes (GTSPP, ARGO, QARTOD, etc) since none of those schemes alone will completely satisfy testing requirements in all situations. I think the integration with oceansdb to run Climatology tests and Location at Sea tests is particularly exciting. I hope publication in JOSS will bring this library to the attention of more groups and encourage further use and collaboration to flesh this out.
The library takes in data as a numpy array, so in theory the data could come from any source or format. However you must create a "data object" to wrap your data. Given that there are other mature and well-used libraries with data representations (xarray, pandas, netcdf, etc), why not leverage one of those? Or at least make it easy to use CoTeDe if your data is already in one of those common formats. I created a notebook to try to understand how to run QC tests using the library ( https://github.com/castelao/CoTeDe/pull/42), and I was able to fairly easily load a netcdf file using xarray and run tests, but I got an error when using a pandas dataframe and it was not clear how to debug it. I think a little documentation could go a long way here, to help people get started.
Items marked BLOCKER below should be resolved before acceptance. Everything else is a recommendation.
Paper:
Documentation:
Installation
Contributing
Performance
@jessicaaustin , thank you for your review. I'll respond to it ASAP.
I do not have a conflict of interest in this review. I do maintain internal (non-public) code repository @wdft-coastal that does contain some of the similar functions as presented in the current code but are specific for my dataset and organization.
I am extremely pleased with the premise of this package for organizing QC procedures and being able to run multiple tests. This is a high priority item for my organization and my current data sets, as our own QAQC methods often can conflict with other agencies that follow different QAQC procedures. In the last few months I have been preparing code that can mimic QARTOD style tests on our own data internally, but I wanted the ability to apply alternative test schemes and to skip tests that are not appropriate for our data. Having a maintained package that accomplishes this task is not only helpful but it also greatly increases your data confidence as now the QC methods are maintained by other groups in a transparent mechanism. I fully expect to be implementing this package into our own website and data products at TWDB.
I share reviewer one’s sentiment about the philosophy of leveraging other data libraries such as netCDF or pandas. Although our own internal datasets are SQL we do make use of pandas where possible for data manipulation and in the future we will likely be forced to move to netCDF to process very large datasets for QAQC. This functionality is not a show-stopper for the package, but it is a severe limitation for its acceptance in the community.
Items marked BLOCKER below should be resolved before acceptance. Everything else is a recommendation.
Paper
Documentation:
Installation
Contributing
Performance
@evanleeturner, thanks for the review. I already started to work on it.
Since you mentioned a potential interest in using yourself, you might appreciate how can you customize CoTeDe's configuration. In summary, you can create a setup of tests to apply and save it to use later. I do that myself a lot, so different regions or sensors have their own sequence of procedures to apply.
@whedon generate pdf
Hi @evanleeturner , you might be able to edit the top post, and if so, could you click on the review items that you agree with, please? That would help me to write my response. Thanks!
Response to @jessicaaustin
Conflict of interest
I do not believe this is a conflict of interest as outlined in the JOSS guidelines, but I do want to state that I am a maintainer of a similar open-source QC library, https://github.com/ioos/ioos_qc .
Overall thoughts:
I agree 100% with the author's philosophy that QC procedures should be organized and distributed as code packages in order to widely and consistently used. I also agree with the goal of creating a framework that can run tests from multiple different QC test schemes (GTSPP, ARGO, QARTOD, etc) since none of those schemes alone will completely satisfy testing requirements in all situations. I think the integration with oceansdb to run Climatology tests and Location at Sea tests is particularly exciting. I hope publication in JOSS will bring this library to the attention of more groups and encourage further use and collaboration to flesh this out.
Thanks! As a side note, the package OceansDB was originally together with CoTeDe, as a single package (since the first generation, ~ 2006), but as CoTeDe grew larger and I realized that some people could use OceansDB outside CoTeDe’s scope, I moved it outside as an independent package (~ 2015).
The library takes in data as a numpy array, so in theory the data could come from any source or format. However you must create a "data object" to wrap your data. Given that there are other mature and well-used libraries with data representations (xarray, pandas, netcdf, etc), why not leverage one of those? Or at least make it easy to use CoTeDe if your data is already in one of those common formats. I created a notebook to try to understand how to run QC tests using the library ( castelao/CoTeDe#42), and I was able to fairly easily load a netcdf file using xarray and run tests, but I got an error when using a pandas dataframe and it was not clear how to debug it. I think a little documentation could go a long way here, to help people get started.
Thanks for sharing your opinion on using xarray or pandas. It’s now clear for me that I need to clarify this in the documentation. This will be addressed at https://cotede.readthedocs.io/en/dev/data_model.html
In summary, I believe that the data model adopted by CoTeDe does not limit the users, but the other way around: it gives more freedom. The package xarray, which was initially based on the netCDF structure, satisfies the data model expected by CoTeDe, thus most of CoTeDe’s functionalities can be used without any change, see the example. Since xarray can read directly netCDF and pandas objects can be converted into xarray objects, all those inputs can be used by CoTeDe, plus any other format that satisfies CoTeDe’s minimalist data model. CoTeDe has been used with SQL databases and MatLab based pipelines, just two examples where less requirements makes a difference.
Review
Items marked BLOCKER below should be resolved before acceptance. Everything else is a recommendation.
Paper:
* Overall this paper is clear and easy to read, and does a good job of describing the motivation and overall functionality of the software * **Highly recommended:** It would be great to list similar automated QC packages and describe how CoTeDe compares. For example, CoTeDe is more generic than ioos_qc, which depends on pandas/xarray and assumes netCDF integration, and CoTeDe allows integration with a wide range of QC test suites, whereas ioos_qc (currently) only focuses on QARTOD. * I do understand that this takes time, so I don't want it to block acceptance here. But this is the first question I had when I looked at this library, and would love to see it in the README someday.
I see your point and I will consider that. The brief comparison that you suggested above could be a good start, but what should be said about the similarities between the two packages? For instance, what should we say about QC configuration, which seems like ioss_qc followed a quite similar approach as the one proposed by CoTeDe. Would you agree? On a side note, since this is a highly recommended suggestion, I was expecting that ioos_qc would adopt the same policy, but I could not find any reference to CoTeDe in the ioos_qc documentation.
* Highly recommended, if possible: Include a list of groups actively using this library, to give us an idea of how widely used it is. (this could also be added to the code documentation)
Thanks, that’s a good idea. However, I believe that the choice to be listed belongs to the users. I’ll create a designated space for that on CoTeDe’s website and encourage the users to add themselves to the list.
* Recommended: Describe how to pronounce CoTeDe (though I suppose leaving some amount of mystery is the python way :-) )
At some point I removed that from the documentation, but I’ll place it back.
Documentation:
* **BLOCKER:** The important classes and most of the test methods have docstrings, which is great. But I was not able to find any generated API docs under https://cotede.readthedocs.io/ . [castelao/CoTeDe#41](https://github.com/castelao/CoTeDe/issues/41)
I have expanded the documentation (https://cotede.readthedocs.io/) to include the API reference for the public resources, i.e. everything that a new user would need to know to use CoTeDe. I’m also expanding the notebooks with practical examples to better illustrate how to use the package. I intend to extend the API reference with internal resources that might be useful for developers (https://cotede.readthedocs.io/en/latest/api.html).
Installation
* I was able to install and use the software using the instructions in the repo * Nice to have: [castelao/CoTeDe#39](https://github.com/castelao/CoTeDe/issues/39)
Thanks for the suggestion. I understand the benefits of source-forge and I’ll definitely add CoTeDe into that in the future, but it is not the priority at the moment. I noticed that ioos_qc uses pip on its notebooks, which is yet another example of how PIP is convenient and functional.
Contributing
* I was able to install dependencies and run tests locally using the instructions in the repo * Recommended: [castelao/CoTeDe#38](https://github.com/castelao/CoTeDe/pull/38) * Recommended: [castelao/CoTeDe#40](https://github.com/castelao/CoTeDe/issues/40)
I removed the recommendation on flake8. I do not want to turn back any potential contribution just due to formatting. I’ve been cleaning, refactoring, and formatting the code and I will continue on doing so.
Performance
* I did not find any mention of performance testing or guarantees in the code or documentation * Highly recommended you add performance tests. During development of https://github.com/ioos/ioos_qc we found that small tweaks to our testing algorithms could lead to severely degraded performance on large datasets. Recommend you add these tests to the travis-ci build and add it to the checklist of items to review if someone submits a PR.
Thank you, that is a very good idea. I've always assumed that by using Python it was implied that speed was not a priority, but there is nothing to lose on having an idea about the relative cost among the different tests. I’ll implement that on the first chance. Note that one of the consistency tests is to guarantee that the ProfileQC object is serializable, so that it can be transported by multiprocessing. On large scale databases I have been running CoTeDe in parallel, a functionality that I would like to move to the public repository on the first chance.
A note, when I read "performance" for a QC procedure the first thing that comes to my mind is not speed but the skill of identifying bad measurements without mistakes. For that, I’ve been adopting consistency tests since the early stages of CoTeDe.
Response to @evanleeturner
Conflict of interest
I do not have a conflict of interest in this review. I do maintain internal (non-public) code repository @wdft-coastal that does contain some of the similar functions as presented in the current code but are specific for my dataset and organization.
Overall thoughts:
I am extremely pleased with the premise of this package for organizing QC procedures and being able to run multiple tests. This is a high priority item for my organization and my current data sets, as our own QAQC methods often can conflict with other agencies that follow different QAQC procedures. In the last few months I have been preparing code that can mimic QARTOD style tests on our own data internally, but I wanted the ability to apply alternative test schemes and to skip tests that are not appropriate for our data. Having a maintained package that accomplishes this task is not only helpful but it also greatly increases your data confidence as now the QC methods are maintained by other groups in a transparent mechanism. I fully expect to be implementing this package into our own website and data products at TWDB.
Thanks, I’m glad to hear that we have the same vision on shared public QC procedures and the importance of giving freedom to the operator to decide which tests to apply, including the tunning of each test. Please, don’t hesitate to open issues with requests or discussions that could help you to implement CoTeDe in your project. My vision for the QC configuration was on cases exactly like yours. I’ll improve the documentation on that subject.
I share reviewer one’s sentiment about the philosophy of leveraging other data libraries such as netCDF or pandas. Although our own internal datasets are SQL we do make use of pandas where possible for data manipulation and in the future we will likely be forced to move to netCDF to process very large datasets for QAQC. This functionality is not a show-stopper for the package, but it is a severe limitation for its acceptance in the community.
Thanks for sharing your opinion on using netCDF or pandas. It’s now clear for me that I need to clarify this in the documentation. This will be addressed at https://cotede.readthedocs.io/en/dev/data_model.html In summary, I believe that the data model adopted by CoTeDe does not limit the users, but the other way around: it gives more freedom. The package xarray, which was initially based on the netCDF structure, satisfies the data model expected by CoTeDe, thus most of CoTeDe’s functionalities can be used without any change, see the example:. Since xarray can read directly netCDF and pandas objects can be converted into xarray objects, all those inputs can be used by CoTeDe as it is now, plus any other format that satisfies CoTeDe’s minimalist data model. I also use CoTeDe on a PostgreSQL DB with more than a TB of data, and instead of pandas.read_sql I use psycopg2 directly. I know that CoTeDe has been also used in a MatLab data pipeline, another example where fewer dependencies facilitate its use, especially for MatLab users without Python experience. I believe that this is a case where less is more, and for CoTeDe there is no need to force dependencies on more packages.
Review
Items marked BLOCKER below should be resolved before acceptance. Everything else is a recommendation.
Paper
* The paper itself was succinct, well-written, and very appropriate to the code. * **Recommended:** Please consider adding the functionality of swapping methods for calculating seawater: [castelao/CoTeDe#43](https://github.com/castelao/CoTeDe/issues/43)
Issue #43 raises a very good point. We will certainly find derived variables (salinity, density, sound speed, etc.) estimated with different standards (especially on historical data), which may affect inter-comparisons. That is indeed important but is beyond CoTeDe’s goal, which is intended to evaluate the quality of a dataset. Currently, the only place that I use GSW is to estimate sea water density when density is not directly available to evaluate inversions in the water column. For that purpose, I argue that we should use the best and most recent technique. I do see your point that in some situations it might be useful, if not necessary, to employ the old relations for a correct comparison, but I think that this should be done outside CoTeDe, by the operator, and then used as input for CoTeDe. Note that I’m strongly considering adopting your suggestion in another package PySeabird, which is meant to parse and estimate derived variables from CTDs.
* **Recommended:** mirror reviewer one that I don’t know how to pronounce CoTeDe! 😊
At some point, I removed that from the documentation, but I’ll put it back.
Documentation:
* **BLOCKER**: Please also complete the issue that was brought by reviewer one concerning the API docs as this would also be extremely helpful for me to implement this code: https://cotede.readthedocs.io/ . [castelao/CoTeDe#41](https://github.com/castelao/CoTeDe/issues/41)
I have expanded the documentation (https://cotede.readthedocs.io/) to include the API reference for the public resources, i.e. everything that a new user would need to know to use CoTeDe. I’m also expanding the notebooks with practical examples to better illustrate how to use the package. I intend to extend the API reference with internal resources, which might be useful for developers (https://cotede.readthedocs.io/en/dev/api.html).
Installation
* I was able to get the package running.
Contributing
* I was able to run the tests.
Performance
* There were no performance tests or statements regarding performance. * I did not have an issue with performance with my own tests, but I also have very small datasets at the moment. Additionally, if I were to run this package in a production environment, the performance hit would be negated because our data would be checked as it inserted into our database which would be a heftier cost of CPU cycles. If I were going to run these tests on the entire database, I would do so offline or during non-peak hours on the production systems in any case.
Thanks for raising that point. I’m adding a section in the documentation about the performance. (https://cotede.readthedocs.io/en/dev/performance.html). I agree with you, the QC in the inserting phase of the pipeline could cause a bottleneck and propagate serious issues. Something I have done before was to implement a parallel process that would run the QC on the data already inserted in the DB, on-demand, triggered by new inserts. The system that I developed for AOML, back in 2006, loaded the Python QC tests as a PostgreSQL procedural language, so it was an internal procedure with an internal trigger, per profile (Again, easier if require less external packages). That probably results in the best speed for a SQL DB, but it is not necessarily the easiest approach to maintain.
@kthyng , could you confirm the next step, please? Is it correct that the reviewers need to checkmark each box in the very first post before we move to the next step? Thanks!
:wave: Hey @castelao...
Letting you know, @kthyng
is currently OOO until Sunday, March 15th 2020. :heart:
So I can't seem to edit the first post. When I click on the ... my only options are "copy link" and "quote reply"
@whedon re-invite @evanleeturner as reviewer
The reviewer already has a pending invite.
@evanleeturner please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations
I just did. It said the invitation has expired.
👋 @evanleeturner - you shouldn't edit the first comment, but you should check boxes in it. If you have comments, add them as new comments
@openjournals/dev - can you check on the issue @evanleeturner is having?
It won't let me change the box when I click it - nothing happens.
that's a permission problem since you are not in the group - @openjournals/dev will need to fix it since the re-invite didn't work for some reason
@danielskatz , thanks!
Dear authors and reviewers
We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.
Thanks in advance for your understanding.
Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team.
@whedon re-invite @evanleeturner as reviewer
OK, the reviewer has been re-invited.
@evanleeturner please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations
note that I tried that a few hours ago and it didn't seem to work...
Whedon's invite functionality is definitely making the invite correctly:
@evanleeturner - are you sure you're logged in with the correct GitHub account when clicking the link?
@castelao
Thanks for sharing your opinion on using xarray or pandas. It’s now clear for me that I need to clarify this in the documentation. This will be addressed at https://cotede.readthedocs.io/en/dev/data_model.html
In summary, I believe that the data model adopted by CoTeDe does not limit the users, but the other way around: it gives more freedom. The package xarray, which was initially based on the netCDF structure, satisfies the data model expected by CoTeDe, thus most of CoTeDe’s functionalities can be used without any change, see the example. Since xarray can read directly netCDF and pandas objects can be converted into xarray objects, all those inputs can be used by CoTeDe, plus any other format that satisfies CoTeDe’s minimalist data model. CoTeDe has been used with SQL databases and MatLab based pipelines, just two examples where less requirements makes a difference.
Yes, I agree that CoTeDe's model is flexible and I don't think you should change it. All I am saying is that many people will come into this with their data model already in xarray/pandas/netcdf/etc and having some example code to copy/paste to get them started quickly will be a huge help. Probably some unit tests too, to be safe. Basically, since those data models are widely used in the python community, it should be seamless to use CoTeDe if your data is already in that format. Sorry I don't think I made this clear in my initial review.
- Highly recommended: It would be great to list similar automated QC packages and describe how CoTeDe compares. For example, CoTeDe is more generic than ioos_qc, which depends on pandas/xarray and assumes netCDF integration, and CoTeDe allows integration with a wide range of QC test suites, whereas ioos_qc (currently) only focuses on QARTOD.
- I do understand that this takes time, so I don't want it to block acceptance here. But this is the first question I had when I looked at this library, and would love to see it in the README someday.
I see your point and I will consider that. The brief comparison that you suggested above could be a good start, but what should be said about the similarities between the two packages? For instance, what should we say about QC configuration, which seems like ioss_qc followed a quite similar approach as the one proposed by CoTeDe. Would you agree? On a side note, since this is a highly recommended suggestion, I was expecting that ioos_qc would adopt the same policy, but I could not find any reference to CoTeDe in the ioos_qc documentation.
Yes, like I said this is an involved process and can not be done quickly. Maybe you could put a placeholder in your documentation and fill it in as you (or other contributors) can get to it? Even just a list of other packages you think are comparable would be useful.
You're right, ioos_qc does not do this, but one of the review points above for submission to JOSS mentions, "State of the field: Do the authors describe how this software compares to other commonly-used packages?" so that's where this is coming from.
Documentation:
* **BLOCKER:** The important classes and most of the test methods have docstrings, which is great. But I was not able to find any generated API docs under https://cotede.readthedocs.io/ . [castelao/CoTeDe#41](https://github.com/castelao/CoTeDe/issues/41)
I have expanded the documentation (https://cotede.readthedocs.io/) to include the API reference for the public resources, i.e. everything that a new user would need to know to use CoTeDe. I’m also expanding the notebooks with practical examples to better illustrate how to use the package. I intend to extend the API reference with internal resources that might be useful for developers (https://cotede.readthedocs.io/en/latest/api.html).
https://cotede.readthedocs.io/en/latest/api.html does not work for me. Is this still in progress?
- Highly recommended you add performance tests. During development of https://github.com/ioos/ioos_qc we found that small tweaks to our testing algorithms could lead to severely degraded performance on large datasets. Recommend you add these tests to the travis-ci build and add it to the checklist of items to review if someone submits a PR.
Thank you, that is a very good idea. I've always assumed that by using Python it was implied that speed was not a priority
IMHO people aren't going to be too worried about a fraction of second difference, but if it suddenly goes from <1s to evaluate a dataset to 20s, you will want to know what commit caused that regression. This is exactly the scenario that happened with ioos_qc, and it's this sort of catastrophic issue that I'm talking about.
@jessicaaustin , thanks for the fast response. Yes, my bad. The latest version of the documentation is on the dev branch:
Ok great thanks. I'm going to check off the rest of the items on my checklist; at this point I think this is ready to accept.
Looks like i was able to check off the first items...
I am ready to accept this paper!
Thanks everyone so much for your patience! Everything is crazy crazy here for Covid-19 and I'm working remote today.. so want to get this accepted and done while I am online!
Looks like i was able to check off the first items...
@evanleeturner - thanks for the update. Just verifying that you've completed your review? If so, could you update the checklist accordingly (if permissions are fixed for you now).
Yes, I am finished. When I click one box it sits there and locks the others - so just keeping diligent here and checking all of them as they open up for me.
* Highly recommended, if possible: Include a list of groups actively using this library, to give us an idea of how widely used it is. (this could also be added to the code documentation)
Thanks, that’s a good idea. However, I believe that the choice to be listed belongs to the users. I’ll create a designated space for that on CoTeDe’s website and encourage the users to add themselves to the list.
As listed on the JOSS website, authors should in their paper: "Mention (if applicable) a representative set of past or ongoing research projects using the software and recent scholarly publications enabled by it."
@castelao I see that the boxes are checked off and both reviewers have stated that they are ready to accept. Please address my note immediately above.
Also, there are two PRs that are open above. Can you and @jessicaaustin state if there are any lingering issues associated with them?
So long as https://github.com/castelao/CoTeDe/issues/41 is resolved and those API docs are available at https://cotede.readthedocs.io/en/latest/ , I'm ok with accepting. My other PRs/issues are nice-to-have.
@whedon generate pdf
Thanks to everyone involved in this process, I do appreciate your time on this.
@kthyng I believe that I addressed everything. What is the next step? Shall I do a new release and obtain a new DOI? Thanks,
@whedon check references
Reference check summary:
OK DOIs
- 10.13155/33951 is OK
- 10.1016/j.mio.2014.09.001 is OK
MISSING DOIs
- None
INVALID DOIs
- None
@castelao Your references need some tweaks. I put some in here which you can merge if you approve. However, I see that the "note" category from "manual" entries is not coming through. Can you play around with these entries to get that information in?
@whedon check references
Reference check summary:
OK DOIs
- 10.13155/33951 is OK
- 10.1016/j.mio.2014.09.001 is OK
MISSING DOIs
- None
INVALID DOIs
- None
Submitting author: @castelao (Guilherme Castelao) Repository: https://github.com/castelao/CoTeDe Version: v0.21.3 Editor: @kthyng Reviewer: @jessicaaustin, @evanleeturner Archive: 10.5281/zenodo.3733959
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@jessicaaustin & @evanleeturner, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @kthyng know.
✨ Please try and complete your review in the next two weeks ✨
Review checklist for @jessicaaustin
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @evanleeturner
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper