openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
694 stars 36 forks source link

[REVIEW]: sandpyper: A Python package for UAV-SfM beach volumetric and behavioural analysis #3666

Closed whedon closed 2 years ago

whedon commented 2 years ago

Submitting author: @npucino (Nicolas Pucino) Repository: https://github.com/npucino/sandpyper Version: v1.3.3 Editor: @crvernon Reviewer: @dbuscombe-usgs, @chrisleaman Archive: 10.5281/zenodo.5565487

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/ff00f33cd9152b1834727ffb8be4122a"><img src="https://joss.theoj.org/papers/ff00f33cd9152b1834727ffb8be4122a/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/ff00f33cd9152b1834727ffb8be4122a/status.svg)](https://joss.theoj.org/papers/ff00f33cd9152b1834727ffb8be4122a)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@dbuscombe-usgs & @chrisleaman, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @crvernon know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @dbuscombe-usgs

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @chrisleaman

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 2 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @dbuscombe-usgs, @chrisleaman it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1111/0031-868x.00152 is OK
- 10.1371/journal.pone.0118571 is OK
- 10.1890/10-1510.1 is OK
- 10.1038/s41598-021-83477-6 is OK
- 10.1016/j.earscirev.2017.04.011 is OK
- 10.1016/j.ecolecon.2006.10.022 is OK
- 10.1016/j.ocecoaman.2018.01.001 is OK
- 10.1016/j.isprsjprs.2015.02.009 is OK
- 10.21105/joss.01890 is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 2 years ago

Wordcount for paper.md is 1349

whedon commented 2 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.95 s (41.0 files/s, 54866.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           6           1709           1175           4856
Markdown                        17            285              0            655
Jupyter Notebook                 7              0          42186            625
TeX                              1              9              0            404
YAML                             6             25             10            191
JavaScript                       1              1              0             19
HTML                             1              2              0              9
-------------------------------------------------------------------------------
SUM:                            39           2031          43371           6759
-------------------------------------------------------------------------------

Statistical information for the repository 'a855e74a3b68ce3aa0fa37a1' was
gathered on 2021/08/28.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Nicolas Pucino                 187         18307          11042           94.22
npucino                         17          1078            724            5.78

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Nicolas Pucino             7760           42.4          0.4                5.66
whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

crvernon commented 2 years ago

@dbuscombe-usgs @chrisleaman @npucino :wave: the review takes place in this issue. Thanks!

crvernon commented 2 years ago

Also, please don't forget to add a link to this review issue in any issues or pull requests you may generate in the https://github.com/npucino/sandpyper repository. This will help everyone have a single point of reference.

npucino commented 2 years ago

Thank you guys for your time! Cheers

crvernon commented 2 years ago

:mega: Mid-week rally! Just checking in to see how things are going @dbuscombe-usgs and @chrisleaman ? Don't hesitate to reach out if you have any questions.

👏 Keep up the good work!

chrisleaman commented 2 years ago

Thanks @crvernon, I've had a read of the paper and just summarizing my thoughts. Also planning to run the package on some of my own data to test it out 😄

dbuscombe-usgs commented 2 years ago

@crvernon @npucino I am having a lot of difficulty installing this package using the provided instructions. I have a detailed issue here

I have a lot of experience with conda and my perspective is that is this a suboptimal way to install a package with all of these very complicated dependencies.

dbuscombe-usgs commented 2 years ago

I made my own yml file installer to deal with the issues of the mixed conda/pip installation, that works a lot better and in fewer lines of code. and far fewer instructions, see here However, the __init__ script for the sandpyper pip package has broken syntax that means I get errors . So, I should now be able to run this software when @npucino gets a chance to revisit that.

Installation is typically the most difficult thing to get right with these types of python packages. I expect it will be plain sailing after that!

npucino commented 2 years ago

@dbuscombe-usgs thanks for the detailed issue report, I will fix this Monday. I tested the package installation and unit tested the code in github action and assumed it worked out in the real world. Also I noticed that for some reason your pip install sandpyper downloaded a very old version 0.0.2 (with that typo in the init file), while the latest one in pip is 1.0.0 which is drastically different. I think it is becasue the pip install GDAL confilcts made pip to roll back to a previous version where the GDAL doesn't break the imports. I ll look into that!

dbuscombe-usgs commented 2 years ago

Great, thanks for the quick response @npucino - that sounds like a good plan. I'll check back in Monday and try to help troubleshoot. I'm pretty good at finding conda and pip issues in other people's code, so don't take it personally (I also really hate the python ecosystem for its crummy package managers!)

in the yml file I made, it does successfully create all the dependencies in conda using conda-forge. then it is looking for a version of sandpyper on pip that is compatible, That is version 0.0.2. That would explain why, I think

but it does prove that a) you can use conda entirely to set up the dependencies in a conda environment without being overly prescriptive about specific versions of packages, and that b) its all the pip-only dependencies that are breaking the install. Its not clear why, for example, why the requirements.txt in the pypi package lists the same libraries as you suggest installing into the conda environment ....

my general recommendation here is that conda environments as a rule want to install conda packages, not pip ones. It works, but only if you allow conda to take precedence for as many of the dependencies as you can, and dont cross-post dependencies between conda and pip, you'll likely have fewer install dependencies

I can also try installing on linux and report back what I find on the issues tab on the repo page

npucino commented 2 years ago

No no I do not take it personally @dbuscombe-usgs, please be as critic as you can I need it to grow and get better at this and improve that package! No mercy is good science. I think I will have to change the installation procedure and move it all to conda rather than pip. CHeers

crvernon commented 2 years ago

Thanks @dbuscombe-usgs and @npucino it looks like you are working towards a solution. Feel free to reach out if you have any questions.

whedon commented 2 years ago

:wave: @chrisleaman, please update us on how your review is going (this is an automated reminder).

whedon commented 2 years ago

:wave: @dbuscombe-usgs, please update us on how your review is going (this is an automated reminder).

chrisleaman commented 2 years ago

👋 @chrisleaman, please update us on how your review is going (this is an automated reminder).

Sorted out package installation issues, working through example notebooks at the moment.

dbuscombe-usgs commented 2 years ago

I will try again to install and run the tests / notebooks, so I can complete the 'functionality' section of the checklist, otherwise I have read through the paper and documentation and they are both good as-is

npucino commented 2 years ago

Thx @chrisleaman and @dbuscombe-usgs for the reviews. I am currently working on:

  1. investigate the bugs reported by @chrisleaman
  2. move the package into conda-forge as a conda recipe

So, maybe @dbuscombe-usgs you can wait until I fix those issues before running trough the code. Just a suggestion.

CHeers!

dbuscombe-usgs commented 2 years ago

Yes, I can certainly wait. Thanks!

chrisleaman commented 2 years ago

@crvernon - Just an update, I'm happy with code and documentation as @npucino has addressed the issues I've raised. I have a couple of minor suggestions for the manuscript which I will finalise and create an issue for by the end of the week.

npucino commented 2 years ago

@whedon set v1.2.0 as version

whedon commented 2 years ago

I'm sorry @npucino, I'm afraid I can't do that. That's something only editors are allowed to do.

crvernon commented 2 years ago

:mega: Mid-week rally! Just checking in to see how things are going.

:open_hands: Thanks @chrisleaman for the notification! Glad to see things are progressing well.

@dbuscombe-usgs and @npucino it looks like you both are still working out some bugs so keep up the good work!

Have a great and productive rest of the week!

npucino commented 2 years ago

Hi @crvernon , I packaged the code in conda and I already pull request its recipe to the team: https://github.com/conda-forge/staged-recipes/pull/16228

I am just waiting to be added to the conda server and then test with @dbuscombe-usgs if everything works well with this strategy! I do not know how long this usually take, hopefully not too much.

Cheers! Nick

dbuscombe-usgs commented 2 years ago

I'm on vacation for a week then I can get to finalizing my review. Apologies, I left the computer at home ...

npucino commented 2 years ago

I am having an hard time relaxing the requirements for Sandpyper right now.

Basically, as per conda request and @dbuscombe-usgs request I was trying to use the most recent versions of Sandpyper dependencies rather than providing strict versions of those. Apart from minor changes to the code (CRS definition and some minor Pysal import changes), the code should run smoothly with no issues. Note that the tests and Jupyter Notebooks that I created are based on the data I generated with Sandpyper within the environment with strict dependencies requirements.

Issue

When I create a fresh new environment with no dependencies versions specified except for Python = 3.8, I have (for now) one issue that I cannot understand where it comes from.

In the unsupervised labelling procedure, within my get_sil_location, I use Scikit-Learn KMeans and Silhouette score functions to cluster my points and to compute silhouette scores. I set an integer as random state so that I obtain always the same results. The expected results must be obtained in order to run the tests and to follow along the notebooks I provided to see how Sandpyper works. The code works, but with the relaxed environment the labels KMeans assign are slightly different, which affects the silhouette scores, the suboptimal number of clusters and eventually the labels the points will be assigned to. In order to run tests and provide users with actionable Jupyter Notebooks, these little changes would require me to go through the dataset again, create new polygons of corrections and re-identify the classes of all the test surveys, which takes quite a bit of time and I ideally would like no to go through that route at this stage if possible. So I looked at the changelog of sklearn (changes from 0,24 to 1.0 versions) and some minor changes occured in the KMeans algorithm, but those seems to be related to performances rather than random state. I also tested environments completely relaxed except for sklearn==0.24.1 which is the one I used for the test data, but the inconsistencies are still there.

My main suspect was the random state but it looks like that is not the case. I don't understand what is causing to change the output from the get_sil_location function, despite it is working as expected. If I use the strict environment, everything works as expected.

Question

As @chrisleaman already proved, Sandpyper is working as expected and passing tests as is (hosted in PyPI with strict dependencies requirements). Is hosting Sandpyper in Anaconda a must do for this package to be published in JOSS? It is of course very important to relax the requirements and upload the package to conda and I really aim to do it in the near future. But if this is not super strictly necessary in the context of publishing in JOSS, I would rather go through the process in a later stage.

Cheers! Nick

chrisleaman commented 2 years ago

Re https://github.com/openjournals/joss-reviews/issues/3666#issuecomment-927454043, I'm fine for this JOSS review if sandpyper is not on conda.

We should still confirm that @dbuscombe-usgs is able to get it installed and running in his environment or if there are still other issues.

npucino commented 2 years ago

Re #3666 (comment), [...] We should still confirm that @dbuscombe-usgs is able to get it installed and running in his environment or if there are still other issues.

Absolutely. Thanks.

dbuscombe-usgs commented 2 years ago

I am back from vacation and will be ready to tackle this tomorrow. Apologies for the delay

crvernon commented 2 years ago

@npucino - to address your earlier question of:

Is hosting Sandpyper in Anaconda a must do for this package to be published in JOSS?

The answer from my perspective is: no. Having this package able to be installed and function given the documentation and claims that you provide is sufficient. I agree that this would be a nice addition in the future to extend the usability of your package; though it may not be in the current scope of your development.

dbuscombe-usgs commented 2 years ago
  1. I have removed the previous sandpyper_env conda environment I made before
  2. I don't see any instructions for a conda package installation on https://github.com/npucino/sandpyper, so I'm not really sure whether that is yet an option (it seemed like it was, from this thread, but now I'm not so sure). I was expecting the README to be updated with instructions for how you would like to me to install the sanpyper conda-forge package.
  3. However, I have now successfully install using the provided instructions on the README (using the suboptimal legacy versions of various packages) and that has given no errors on installation. Progress!
  4. The installation process is still really messy - for many of the reasons I previously stated - there are several packages that get installed, and reinstalled. Some get reinstalled several times! Using a mixture of pip and conda to specify a mixture of major, minor, and incremental versions of programs is kinda sloppy, and I refuse to accept this is absolutely necessary, but it WORKS (now), so I'll let it slide!

I agree with @crvernon and @chrisleaman that a conda solution shouldnt be a requirement for JOSS, however I will say that scikit-learn is the major ML library in python, and using a version that is several versions behind the main trunk (we are at 1.0 now) is storing up problems for the future, IMHO, if this is to be maintained for several years. I anticipate you'll be fielding a lot of questions about installation going forward, if the toolbox proves popular (I hope so). However, over time hopefully a solution is provided. I took a quick look at the changelogs and API references for the KMeans functions over time and, as @npucino noted, there doesn't seem to be anything related to random seeds or state, so it is puzzling!

To run the notebooks., the instructions are inadequate for a novice user, who needs to be told they need to download the github package (like I said before, why go to the bother of making a pip package when the user has to download the code anyway?). A few sentences here in the README explaining to the potential user would go a long way

In the first notebook, "1 - Introduction and data preparation", you don't explain where 'demo_data.rar' is located. Its not a huge deal, because I discovered you actually don't need it, but I did waste about 10 minutes looking for it! I looked at the paper dataset, logging into the propeller site as-per the 'data availability' statement, but I dont quickly see how to download that data file. In fact, it is not at all clear how to download any data from the propeller site...

I am able to run all the provided examples. The toolbox is fast, and very useful. The methods work as-described, and the statements made about the software in the paper and supporting info are accurate. Good job!

I'll complete the checklist and APPROVE this for JOSS. I trust the level of documentation for installation and implementation will improve with a growing user base

npucino commented 2 years ago

Hi @dbuscombe-usgs , thanks for your feedback! Let me address your valuable points one at the time.

Conda-forge package

One of the major request from yourself and conda-forge reviewers is to relax the requirements for Sandpyper. If I relax the requirements basically telling conda to install the latest versions of packages which are compatible with the user active environment, I get that weird behaviour that I explained in my last comment https://github.com/openjournals/joss-reviews/issues/3666#issuecomment-927454043. This doesn't mean the code doesn't work, it just means that because small differences in the labels assigned by KMeans, I have to re-build the test data from scratch (correction polygons and class dictionaries). This requires some spare time from my side which I am more than willing to spend in my evenings every now and then rather than full time right now. Because it is my priority to make this package better by relaxing requirements and using conda instead of pip.

mixture of pip and conda

I think I see what you mean, when you install the package with pip in the conda environment Sandpyper automatically installs (again) the packages and uses pip. Is this the issue you are referring to?

Because if the conda environment sandpyper_envis created by copying and pasting the requirements in the README, when you install sandpyper 1.2.0 (latest) using pip, you should have already all the right packages installed and this should only be a kind of check. If for some reason your environment has some required packages that differ from the versions specified in both the conda create environment ...(README) and in the requirements.txt, then it uses pip to install those. And that is causing issues.

In my case though this doesn't happen. See the image below. After I created the sandpyper_env (in this case renamed sandpyper_env2), I install sandpyper using pip and this is what I mean by "check".

image

But if that is not what happens to you, there is something weird going on. I am looking into it right now.

To run the notebooks., the instructions are inadequate for a novice user

You are right, I hadn't thought about this! Actually this is a really good point and I would love to hear your suggestion on how to solve this. Even once the package will be in conda, to run the notebooks users will need to download the repo anyway to get the test_data.rar (which by the way I forgot to include it in its compressed format in the repo). I have 2 ideas:

  1. Host the test_data.rar somewhere else easy to download
  2. Use binder to run the notebooks online

option 1 is easy and I can do it quickly. Option 2 is much better but I must first learn how to use binder and then do it, which might take a bit. But looking at this seems easy. Anyway, I now update the README and the notebooks to explaining how to get the test_data.rar.

In the first notebook, "1 - Introduction and data preparation", you don't explain where 'demo_data.rar' is located.

Sorry, definitely something I will explain at the beginning of the notebook. As per the propeller platform, the data is freely accessible but not downloadable at the moment. You can access it, visualize it, make measurements and see changes in all 20 sites in the past 3 years but as of today, it is not downloadable. I think it will eventually be but I have to check with my supervisor. I see the confusion here, in the paper there is written freely accessible which can be misleading.

More than 220 3D datasets are already freely accessible to anyone via a user friendly web-platform to share and communicate information, promote coastal awareness, build knowledge and further increase the impact of our efforts.

I update the README and the notebook and find a way to make the test_data.rarat least downloadable. Thanks again! Nick

dbuscombe-usgs commented 2 years ago

Awesome, thanks @npucino for that quick and thorough response. You've totally cleared up any remaining confusion I had over the KMeans issue.

Thanks for updating the README and notebook

On the mixture of pip and conda, I do get the same output as you, and I do now understand the need for the specific versions and the pip package. But ...the way you mix conda and pip is not optimal: a conda yml file would allow you to combine all requirements for conda and pip into a one-line installation (conda env create -f sandpyper.yml). You would list all the conda and pip (for now, just sandpyper) dependencies together and it would replace the two separate commands you have currently 1) conda create --name sandpyper_env python=3.8 geopandas=0.8.2 matplotlib=3.3.4 numpy=1.20.1 pandas=1.2.2 tqdm=4.56.2 pysal=2.1 rasterio=1.2.0 richdem=0.3.4 scikit-image=0.18.1 scikit-learn=0.24.1 scipy=1.6.0 seaborn=0.11.1 tqdm=4.56.2 pooch=1.4.0 fuzzywuzzy and 2) pip install sandpyper . It would be a much cleaner installation, because conda would not need to search and resolve everything twice, making it faster, and also probably requiring much less disk space (for conda caching). Most importantly, in my experience it is less likely to break this way.

I provided an example you could modify by hard coding in the version numbers. I'm happy to help you test it

crvernon commented 2 years ago

@dbuscombe-usgs and @npucino - I noticed that all tasks have been checked. Are you still collaborating on fixes for this review or have all of your inquiries been addressed @dbuscombe-usgs ?

Also, I noticed that the following related issues are still open:

dbuscombe-usgs commented 2 years ago

Hi @cvernon, I am satisfied - I approve this submission

I will still work with @npucino to help test remaining issues with conda packaging

crvernon commented 2 years ago

Thanks @dbuscombe-usgs I really appreciate your thorough review and willingness to volunteer so much of your time to make this the best product it can be!

npucino commented 2 years ago

Hi guys, thanks again for you extremely valuable comments and contributions!

Yesterday I was trying the .yml option with different degrees of version requirements and still getting some tests to fail. Right now I am also testing different versions of requirements to see if I can relax at least part of the requirements and isolate the issue.

But, using the strict version requirements, tests do not fail so for now, I will stick with this sub-optimal solution.

By the way @dbuscombe-usgs, I addedtest_data.rar in the repo and updated README and notebooks with direct link for downloading the data. I am now trying the last few versions of yml and the I update the README so that users will use the conda env create -f sandpyper_env.yml option.

I never learnt so much in a review process. Thanks a lot @dbuscombe-usgs @chrisleaman @crvernon . Nick

npucino commented 2 years ago

Sorry, last quick update!

Actually, I am close to nail down this weird random generator issue. Apparently something happened between numpy 1.20 and 1.21 (release note) with their default random number generator. I really start to think that is numpy that causes my tests to fail behind the curtains, affecting sklearn KMeans function, because I noticed that if I use numpy 1.20 it works, if I use 1.21 tests do not pass. Therefore, the only requirement I might need is numpy 1.20 but I still have to try it. And, numpy is so fundamental that it is not a long-term solution at all, but, I will get there!

Cheers!

crvernon commented 2 years ago

@npucino no problem at all. Let me know when you have the package in the condition that you want it published in and I will proceed with my tasks.

crvernon commented 2 years ago

@npucino just curious to see if you are ready to move forward with this or if you are still making some final modifications?

npucino commented 2 years ago

Hi @crvernon , sorry my late reply! I am now fixing one last bit and then I am happy for you to proceed with the reviewing process. I am working at it right now, I might be done today, I let you know here very soon, thanks!

npucino commented 2 years ago

@crvernon All good! Ready to move forward. THanks!

crvernon commented 2 years ago

@whedon generate pdf

whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

crvernon commented 2 years ago

@whedon check references

whedon commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1111/0031-868x.00152 is OK
- 10.1371/journal.pone.0118571 is OK
- 10.1890/10-1510.1 is OK
- 10.1038/s41598-021-83477-6 is OK
- 10.1016/j.earscirev.2017.04.011 is OK
- 10.1016/j.ecolecon.2006.10.022 is OK
- 10.1016/j.ocecoaman.2018.01.001 is OK
- 10.1016/j.isprsjprs.2015.02.009 is OK
- 10.21105/joss.01890 is OK
- 10.1038/sdata.2016.24 is OK
- 10.3390/data4020073 is OK
- 10.3390/ijgi8060267 is OK
- 10.1016/j.geomorph.2016.02.025 is OK

MISSING DOIs

- None

INVALID DOIs

- None
crvernon commented 2 years ago

@npucino - very nice paper! The following is a list of a few items to address in your paper specifically before we move forward:

![My figure caption \label{fig:fig1}](my_figure.png)

We find in \autoref{fig:fig1} that show something...

❗ Let me know when these items are taken care of and we will move on. Thanks and keep up the good work!