[REVIEW]: datacleanbot: an automated data cleaning tool

whedon commented 5 years ago

Submitting author: @Ji-Zhang (Ji Zhang) Repository: https://github.com/Ji-Zhang/datacleanbot Version: v0.4 Editor: @arfon Reviewers: @kellieotto, @jjmcnelis Archive: Pending

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737"><img src="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg)](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@kellieotto and @jjmcnelis , please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @usethedata know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @kellieotto

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v0.4)?
[x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @jjmcnelis

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v0.4)?
[x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

whedon commented 5 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @kellieotto it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 5 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 5 years ago

:point_right: Check article proof :page_facing_up: :point_left:

usethedata commented 4 years ago

@kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

kellieotto commented 4 years ago

Thanks for checking in. I plan to get to it this week, no questions!

On Sun, Aug 4, 2019 at 3:40 PM Bruce Wilson notifications@github.com wrote:

@kellieotto https://github.com/kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/1608?email_source=notifications&email_token=AB46QQV6EBRA6UTMAIWEKH3QC5LGPA5CNFSM4IHXWRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QLC3Y#issuecomment-518041967, or mute the thread https://github.com/notifications/unsubscribe-auth/AB46QQWTLG5ITOXFYSBE2HLQC5LGPANCNFSM4IHXWRNQ .

-- Kellie Ottoboni Ph.D. Statistics '19, University of California, Berkeley Fellow at Berkeley Institute for Data Science

Mobile: (650) 520-5056 Website: www.kellieottoboni.com

kellieotto commented 4 years ago

I've created issues in the repo for the problems I encountered. The biggest issue is that I'm not able to get the examples to run. Maybe I'm doing something silly on my end, but I think there's a dependency management issue here.

Ji-Zhang commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

Attempting PDF compilation. Reticulating splines etc...

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

usethedata commented 4 years ago

@whedon add @jjmcnelis as reviewer

whedon commented 4 years ago

OK, @jjmcnelis is now a reviewer

jjmcnelis commented 4 years ago

I apologize for the delay. I will complete my initial review by this Wednesday.

EDIT (2019-08-28) @Ji-Zhang I don't have experience working with several of datacleanbot's dependencies (I'll ask @usethedata to clarify if this disqualifies me as a reviewer). These hangups could be familiar for Python users in your domain. So I'll wait for your comment before opening issues on the following points:

Getting started in fresh Python 3 environments:
- I don't have R installed on my Windows system. Pip exits with an error after it fails to find my R path. Is rpy2 essential to the core functionality of datacleanbot?
- I successfully installed with pip in a virtual environment on ubuntu. The dependencies of the bayesian submodule aren't covered in datacleanbot's setup.py. Is this deliberate? I can't work through the examples on the readthedocs page without manually installing.
After applying the multiple imputation approach, the autoclean example (https://datacleanbot.readthedocs.io/en/latest/Example_autoclean.html) attempts to unpickle a remote file located inside your github repository:

metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))

I haven't dug into your code, but this seems like a questionable practice (predict_best_anomaly_algorithm: line 1033 in https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/dataclean.py). Is it common for machine learning software to access remote resources like this?

There's another piece to this that I'd like you to clarify. From my understanding of the pickle module, pickled object(s) must be unpickled in a Python environment with compatible versions of the dependent software. E.g. In the autoclean example, I couldn't advance past the routine above (predict_best_anomaly_algorithm) because my version of sklearn is incompatible:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1344, in autoclean
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1285, in handle_outlier
    best = predict_best_anomaly_algorithm(X, y)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1049, in predict_best_anomaly_algorithm
    metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 588, in load
    obj = _unpickle(fobj)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python3.6/pickle.py", line 1392, in find_class
    return getattr(sys.modules[module], name)
AttributeError: module 'sklearn.utils.deprecation' has no attribute 'DeprecationDict'

Please correct me if I've misinterpreted the traceback.

setup.py indicates that datacleanbot is intended for use with Python 3. Have you tested or otherwise confirmed incompatibility with Python 2? I don't see any mention this on the readthedocs page.

Overall, I think this will be a good fit for JOSS. Looking forward to your comment.

kthyng commented 4 years ago

Hi @Ji-Zhang how are things going on your end for your JOSS submission?

Ji-Zhang commented 4 years ago

Hi, I am sorry I missed @jjmcnelis 's comments. I am kind of snowed under recently. I will work on that next week and also make the unit test available. Sorry for the delay.

Ji-Zhang commented 4 years ago

Hi @jjmcnelis I apologize for the delay to answer your questions.

The following are my answers:

Getting started in fresh Python 3 environments:
- rpy2 is related to the data type discovery functionality, which is based on the Bayesian model. Without rpy2, the Bayesian model can not run successfully.
- Only openml is recommended to be installed manually. The others are not deliberate. Could you please let me know which submodels you need to install manually? Thanks!
The remote file metalearner is a pre-trained machine learning model. The prediction of the optimal algorithm involves this model. However, the pickle file metalearner cannot be packaged together through pip. So I took this way around. The sklearn incompatible problem is related to GridSearchCV which I used to train the metalearner model. GridSearch had a DeprecationDict in it, but that doesn't exist in sklearn any more. I tried to retrain the model with the newest sklearn version but it doesn't seem work.
Some dependencies of datacleanbot only support Python3 so I assume Python2 won't work. I will add this note to the document.

Please feel free to let me know if you have more questions. Again, sorry for the delay.

kthyng commented 4 years ago

@jjmcnelis How has @Ji-Zhang responded to your concerns?

usethedata commented 4 years ago

I've pinged @jjmcnelis out of band.

jjmcnelis commented 4 years ago

@kthyng @Ji-Zhang I'm sorry for holding you up. Thanks for the thoughtful responses to my questions. I'm satisfied with points 1 and 2. Point 3 needs addressing.

I raised the point about rpy2 in the interest of usability. I don't use R often enough to justify having it installed on my workstation. I'd look into implementing the functionality provided by rpy2 in Python in a future release so that your software will be usable by the widest audience. But this isn't a JOSS requirement as far as I know.
Since my initial review I've come across a few other examples of machine learning resources retrieved remotely when a model runs, so I guess it's reasonably common.
Dependency issues are many, on my end. I haven't been able to run the examples in three separate attempts, on three separate machines. And I get a number of different tracebacks that bear no resemblance to each other. Did you hear from @kellieotto about their dependency issues (#issuecomment-519314651)?

usethedata commented 4 years ago

@Ji-Zhang what are your thoughts about the question in point 3 above?

Ji-Zhang commented 4 years ago

@usethedata I am working on simplifying the package such that the dependencies can become not this heavy. The light version should have been released before the weekend.

Ji-Zhang commented 4 years ago

@jjmcnelis @kellieotto I have deprecated the feature of detecting feature data type using the Bayesian method as too many dependencies there. I created a new clean virtual environment to test it and it works well. I hope it can work on your end successfully as well . Thank you.

danielskatz commented 4 years ago

Hi - As the JOSS Associate Editor-in-Chief on duty this week, I'm going through some submissions with little recent action, and I can't quite tell what is going on in this submission. @usethedata - can you help me understand what is the next action?