openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
697 stars 36 forks source link

[REVIEW]: datacleanbot: an automated data cleaning tool #1608

Closed whedon closed 3 years ago

whedon commented 5 years ago

Submitting author: @Ji-Zhang (Ji Zhang) Repository: https://github.com/Ji-Zhang/datacleanbot Version: v0.4 Editor: @arfon Reviewers: @kellieotto, @jjmcnelis Archive: Pending

Status

status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737"><img src="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg)](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@kellieotto and @jjmcnelis , please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @usethedata know.

Please try and complete your review in the next two weeks

Review checklist for @kellieotto

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @jjmcnelis

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 5 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @kellieotto it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 5 years ago
Attempting PDF compilation. Reticulating splines etc...
whedon commented 5 years ago

:point_right: Check article proof :page_facing_up: :point_left:

usethedata commented 4 years ago

@kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

kellieotto commented 4 years ago

Thanks for checking in. I plan to get to it this week, no questions!

On Sun, Aug 4, 2019 at 3:40 PM Bruce Wilson notifications@github.com wrote:

@kellieotto https://github.com/kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/1608?email_source=notifications&email_token=AB46QQV6EBRA6UTMAIWEKH3QC5LGPA5CNFSM4IHXWRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QLC3Y#issuecomment-518041967, or mute the thread https://github.com/notifications/unsubscribe-auth/AB46QQWTLG5ITOXFYSBE2HLQC5LGPANCNFSM4IHXWRNQ .

-- Kellie Ottoboni Ph.D. Statistics '19, University of California, Berkeley Fellow at Berkeley Institute for Data Science

Mobile: (650) 520-5056 Website: www.kellieottoboni.com

kellieotto commented 4 years ago

I've created issues in the repo for the problems I encountered. The biggest issue is that I'm not able to get the examples to run. Maybe I'm doing something silly on my end, but I think there's a dependency management issue here.

Ji-Zhang commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago
Attempting PDF compilation. Reticulating splines etc...
whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

usethedata commented 4 years ago

@whedon add @jjmcnelis as reviewer

whedon commented 4 years ago

OK, @jjmcnelis is now a reviewer

jjmcnelis commented 4 years ago

I apologize for the delay. I will complete my initial review by this Wednesday.

EDIT (2019-08-28) @Ji-Zhang I don't have experience working with several of datacleanbot's dependencies (I'll ask @usethedata to clarify if this disqualifies me as a reviewer). These hangups could be familiar for Python users in your domain. So I'll wait for your comment before opening issues on the following points:

  1. Getting started in fresh Python 3 environments:

    • I don't have R installed on my Windows system. Pip exits with an error after it fails to find my R path. Is rpy2 essential to the core functionality of datacleanbot?
    • I successfully installed with pip in a virtual environment on ubuntu. The dependencies of the bayesian submodule aren't covered in datacleanbot's setup.py. Is this deliberate? I can't work through the examples on the readthedocs page without manually installing.
  2. After applying the multiple imputation approach, the autoclean example (https://datacleanbot.readthedocs.io/en/latest/Example_autoclean.html) attempts to unpickle a remote file located inside your github repository:

metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))

I haven't dug into your code, but this seems like a questionable practice (predict_best_anomaly_algorithm: line 1033 in https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/dataclean.py). Is it common for machine learning software to access remote resources like this?

There's another piece to this that I'd like you to clarify. From my understanding of the pickle module, pickled object(s) must be unpickled in a Python environment with compatible versions of the dependent software. E.g. In the autoclean example, I couldn't advance past the routine above (predict_best_anomaly_algorithm) because my version of sklearn is incompatible:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1344, in autoclean
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1285, in handle_outlier
    best = predict_best_anomaly_algorithm(X, y)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1049, in predict_best_anomaly_algorithm
    metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 588, in load
    obj = _unpickle(fobj)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python3.6/pickle.py", line 1392, in find_class
    return getattr(sys.modules[module], name)
AttributeError: module 'sklearn.utils.deprecation' has no attribute 'DeprecationDict'

Please correct me if I've misinterpreted the traceback.

  1. setup.py indicates that datacleanbot is intended for use with Python 3. Have you tested or otherwise confirmed incompatibility with Python 2? I don't see any mention this on the readthedocs page.

Overall, I think this will be a good fit for JOSS. Looking forward to your comment.

kthyng commented 4 years ago

Hi @Ji-Zhang how are things going on your end for your JOSS submission?

Ji-Zhang commented 4 years ago

Hi, I am sorry I missed @jjmcnelis 's comments. I am kind of snowed under recently. I will work on that next week and also make the unit test available. Sorry for the delay.

Ji-Zhang commented 4 years ago

Hi @jjmcnelis I apologize for the delay to answer your questions.

The following are my answers:

  1. Getting started in fresh Python 3 environments:

    • rpy2 is related to the data type discovery functionality, which is based on the Bayesian model. Without rpy2, the Bayesian model can not run successfully.
    • Only openml is recommended to be installed manually. The others are not deliberate. Could you please let me know which submodels you need to install manually? Thanks!
  2. The remote file metalearner is a pre-trained machine learning model. The prediction of the optimal algorithm involves this model. However, the pickle file metalearner cannot be packaged together through pip. So I took this way around. The sklearn incompatible problem is related to GridSearchCV which I used to train the metalearner model. GridSearch had a DeprecationDict in it, but that doesn't exist in sklearn any more. I tried to retrain the model with the newest sklearn version but it doesn't seem work.

  3. Some dependencies of datacleanbot only support Python3 so I assume Python2 won't work. I will add this note to the document.

Please feel free to let me know if you have more questions. Again, sorry for the delay.

kthyng commented 4 years ago

@jjmcnelis How has @Ji-Zhang responded to your concerns?

usethedata commented 4 years ago

I've pinged @jjmcnelis out of band.

jjmcnelis commented 4 years ago

@kthyng @Ji-Zhang I'm sorry for holding you up. Thanks for the thoughtful responses to my questions. I'm satisfied with points 1 and 2. Point 3 needs addressing.

  1. I raised the point about rpy2 in the interest of usability. I don't use R often enough to justify having it installed on my workstation. I'd look into implementing the functionality provided by rpy2 in Python in a future release so that your software will be usable by the widest audience. But this isn't a JOSS requirement as far as I know.

  2. Since my initial review I've come across a few other examples of machine learning resources retrieved remotely when a model runs, so I guess it's reasonably common.

  3. Dependency issues are many, on my end. I haven't been able to run the examples in three separate attempts, on three separate machines. And I get a number of different tracebacks that bear no resemblance to each other. Did you hear from @kellieotto about their dependency issues (#issuecomment-519314651)?

usethedata commented 4 years ago

@Ji-Zhang what are your thoughts about the question in point 3 above?

Ji-Zhang commented 4 years ago

@usethedata I am working on simplifying the package such that the dependencies can become not this heavy. The light version should have been released before the weekend.

Ji-Zhang commented 4 years ago

@jjmcnelis @kellieotto I have deprecated the feature of detecting feature data type using the Bayesian method as too many dependencies there. I created a new clean virtual environment to test it and it works well. I hope it can work on your end successfully as well . Thank you.

danielskatz commented 4 years ago

Hi - As the JOSS Associate Editor-in-Chief on duty this week, I'm going through some submissions with little recent action, and I can't quite tell what is going on in this submission. @usethedata - can you help me understand what is the next action?

arfon commented 4 years ago

@whedon assign @arfon as editor

arfon commented 4 years ago

:wave: all, I'm going to step in here and try and wrap up this submission on behalf of @usethedata.

arfon commented 4 years ago

:wave: @kellieotto, @jjmcnelis, in their most recent comment it sounds like @Ji-Zhang has simplified some of the requirements here.

Do you think you could both take another look at this software again and confirm you're able to install it now? If at all possible, I'd love to get this review wrapped up in the next couple of weeks.

kellieotto commented 4 years ago

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

arfon commented 4 years ago

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

Thanks @kellieotto. @Ji-Zhang - can you see if you can figure out what's going on in https://github.com/Ji-Zhang/datacleanbot/issues/10 ?

Ji-Zhang commented 4 years ago

Hi @arfon and @kellieotto , thank you for spotting the bug. This bug is now fixed. Please update to the latest version of datacleanbot and feel free to let me know if any questions. Thank you in advance.

arfon commented 4 years ago

:wave: @kellieotto - please come and take another look at this submission when you get a chance.

kellieotto commented 4 years ago

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

arfon commented 4 years ago

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

OK, thanks for checking back in on this.

arfon commented 4 years ago

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team.

arfon commented 4 years ago

:wave: @kellieotto, @jjmcnelis - how are you getting along with your reviews? I realize these are challenging times for many, so please don't take this enquiry as anything other than a friendly enquiry to see how things are going here.

@kellieotto - it looks like @Ji-Zhang responded to your dependency issues in https://github.com/Ji-Zhang/datacleanbot/issues/10 .

kellieotto commented 4 years ago

@arfon thanks for the reminder. I followed up with another dependency error in that issue.

arfon commented 4 years ago

👋 @Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

Ji-Zhang commented 4 years ago

@Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

Thank you for your kind reminder. Working on it.

kellieotto commented 4 years ago

@arfon I just followed up in the issue with @Ji-Zhang -- still having the same error as before.

Ji-Zhang commented 4 years ago

@kellieotto Sorry for this inconvenience. Could you try it again, please?

labarba commented 3 years ago

👋 hi everybody!

Looks like this review was progressing, a dependency issue was flagged by @kellieotto, and @Ji-Zhang made some changes, and is asking for the reviewer to try again. Could we have an update from those involved? If you need time, for whatever reason, just let use know—of course we understand that times are tough.

kellieotto commented 3 years ago

@labarba @Ji-Zhang I think I should step down as reviewer here. I realize my turnaround time has been very slow. I'm not sure whether it's the package or my installation of Python + associated packages, but I just can't get it to install and build correctly. I'm very sorry for the time wasted.

Ji-Zhang commented 3 years ago

Hi @kellieotto , I think it is because some dependent packages are outdated. I fixed it by updating the setup file. Run pip install datacleanbot in a newly created environment, the example should work. Thanks for all the time you have devoted so far.

arfon commented 3 years ago

Hi @Ji-Zhang. This review has been going on for a very long time (more than 1 year) and both reviewers have struggled with this software. As a result, I've just taken a deeper look at this submission and I have to say that I think this software is still well below the standard that is allowable in JOSS. For example:

Given the age of this submission, and the state of this software currently I'm going to reject this submission. If you are able to substantially improve the quality of this software then we would welcome a resubmission to JOSS at some point in the future.

@kellieotto, @jjmcnelis - thanks for your time here. I'm sorry your work here didn't result in a paper for the author.

arfon commented 3 years ago

@whedon reject

whedon commented 3 years ago

Paper rejected.