JOSS Review Feedback by Nick

Here is my feedback to improve the PyForestScan software for https://github.com/openjournals/joss-reviews/issues/7314.

General checks

[ ] Contribution and authorship: Has the submitting author (@iosefa) made major contributions to the software? Does the full list of paper authors seem appropriate and complete? The main author is the only contributor in the source code of this package based on the commits history. However, the co-authors might have helped the development in early stages before using git to track changes. As it stands, without a few lines of explanation of their roles, I assume the other co-authors had mentoring, reviewing and funding acquisition roles. In JOSS, these roles do not qualify as authorship (link) and should not be included in this submission. Please clarify the co-authors role or do not include them.
[ ] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines While this package does simplify several tasks (reading, cleaning, classifying, filtering, digital terrain and canopy height model creations and more) by acting as a wrapper around PDAL—which can be challenging to master due to its syntax and pipeline structure—it does not introduce any substantial new functionality except the computation of the metrics. For example, in the benchmarking section of the documentation, claims such as, “PyForestScan is designed for high performance and memory efficiency, ensuring it can handle large-scale point cloud datasets effectively,” could equally apply to PDAL itself, as PyForestScan primarily repackages PDAL’s capabilities rather than adding novel performance enhancements. In my opinion, this package is indeed valuable for those looking to streamline tasks that might otherwise require working with PDAL’s more complex syntax with the added bonus of including few key tree metrics. However, since it functions as a wrapper rather than a significant software innovation, I believe it does not meet the minimum requirements for publication as original software in JOSS.

Functionality

[ ] Installation: Installation works well, however, as you provide a Jupyter Notebook as a demo, you could include in the quickstart guide a section on how to install jupyter notebook (conda install jupyter) and run the demo with your test data.
[ ] Functionality: In the demo notebook, there is an error importing plot3d simply due to this functionality being removed due to dependencies problem (commit 249ca40). You just forgot to remove it from the demo notebook. [OPTIONAL]_ However, I think it is worthwhile to have a 3D visualization capability, so if you can fix the dependencies problem you had, it would be cool to re-integrate this function and provide an extra cell in the demo notebook where you showcase plot_3d function and output. Just a suggestion, not required.
[ ] Performance: The benchmark provided in the documentation appears to compare PDAL with leafRrather than showcasing any substantial improvements introduced by this package itself. This because the bulk of the processing time is spent in reading the point clouds rather than computing the metrics, I believe. As a consequence, there doesn’t seem to be a specific contribution from the authors that significantly speeds up the PAI computation beyond what PDAL (repackaged into PyForestScan) already offers.

Documentation

[ ] Example usage: As you have prepared a demo notebook, please mention the demo notebook from the documentation.

Software paper

[ ] References: Citation of PDAL is missing and of uttermost importance given this package almost entirely relies on PDAL.

For more detailed paper feedback, refer to this issue I opened.

@npucino Thank you again for these very detailed comments. They have helped to improve the repository, the library, and the paper.

I will continue my response to your comments (as I did in #13):

The main author is the only contributor in the source code of this package based on the commits history. However, the co-authors might have helped the development in early stages before using git to track changes. As it stands, without a few lines of explanation of their roles, I assume the other co-authors had mentoring, reviewing and funding acquisition roles. In JOSS, these roles do not qualify as authorship (link) and should not be included in this submission. Please clarify the co-authors role or do not include them.

We added a Contributions section to the manuscript. iosefa did contribute a significant majority of the source code, but this was mostly done together with benleamon -- either through pair programming or discussion. benleamon presented the first version of the software to FOSS4G Asia and, while he has not contributed as much code, he has had an important impact on the develop of this library. For this reason, I think that it is important that he remains an author. Ryan Perroy provided guidance on the direction of the project during its later stages, and helped in the writing of the manuscript. However, seeing that this may not qualify, we have agreed to remove him as an author and instead include him in the acknowledgements in recognition of his contributions and support.

While this package does simplify several tasks (reading, cleaning, classifying, filtering, digital terrain and canopy height model creations and more) by acting as a wrapper around PDAL—which can be challenging to master due to its syntax and pipeline structure—it does not introduce any substantial new functionality except the computation of the metrics. For example, in the benchmarking section of the documentation, claims such as, “PyForestScan is designed for high performance and memory efficiency, ensuring it can handle large-scale point cloud datasets effectively,” could equally apply to PDAL itself, as PyForestScan primarily repackages PDAL’s capabilities rather than adding novel performance enhancements. In my opinion, this package is indeed valuable for those looking to streamline tasks that might otherwise require working with PDAL’s more complex syntax with the added bonus of including few key tree metrics. However, since it functions as a wrapper rather than a significant software innovation, I believe it does not meet the minimum requirements for publication as original software in JOSS.

Thank you for raising this concern. We have re-organized the manuscript so that the emphasis is really on the tools that the library provides to calculate key forest structural metrics, rather than any performance gains that are now apparent due to the use of PDAL for IO operations. Metrics like PAI, PAD, FHD, and canopy height, are critical and fundamental measures of forest structure and are very important in many studies in ecology (not only forest ecology, but also the ecology of birds and other biological communities) and land cover dynamics. To our knowledge, their are no other Python libraries that calculate all of these metrics. As is now explained in the manuscript, having these in the Python ecosystem is important because Python use in ecological analysis is only increasing, and there are several other datasets and platforms that utilize Python, such as Google Earth Engine and the GEDI dataset which also include measures of PAI, PAD, FHD, and canopy height. That being said, we also feel that it is important that our library makes use of existing libraries like PDAL, and data formats like EPT, to enhance performance. PDAL and Entwine are incredibly powerful and important tools, and we acknowledge that our work would not be as great if it were not for many of the functions they provide. We hope that our changes to both the manuscript and the documentation of the library itself provide the credit they deserve to allow other scientists and software engineers like ourselves to build from their work to create new and more specialized tools for science and research. That said, there are also performance gains in our library. Though this is not the focus, the calculations of PAI in PyForestScan are roughly 20 times faster than the calculations of LAI (which is functionally equivalent) in existing libraries to calculate this metric in the R programming language. This has a pretty big impact when calculating PAI at very large landscape scales.

Installation: Installation works well, however, as you provide a Jupyter Notebook as a demo, you could include in the quickstart guide a section on how to install jupyter notebook (conda install jupyter) and run the demo with your test data.

Great suggestion! We made substantial revisions to the documentation, and are continuing to improve this. We have included a short guide for Jupyter.

Functionality: In the demo notebook, there is an error importing plot_3d simply due to this functionality being removed due to dependencies problem (commit 249ca40). You just forgot to remove it from the demo notebook. [OPTIONAL] However, I think it is worthwhile to have a 3D visualization capability, so if you can fix the dependencies problem you had, it would be cool to re-integrate this function and provide an extra cell in the demo notebook where you showcase plot_3d function and output. Just a suggestion, not required.

We have removed this function and dependency (for now!). We are planning to add this back, as I agree that it is cool and an important feature to have. Hopefully in the next major release!

Performance: The benchmark provided in the documentation appears to compare PDAL with leafR rather than showcasing any substantial improvements introduced by this package itself. This because the bulk of the processing time is spent in reading the point clouds rather than computing the metrics, I believe. As a consequence, there doesn’t seem to be a specific contribution from the authors that significantly speeds up the PAI computation beyond what PDAL (repackaged into PyForestScan) already offers.

The comparison with leafR is actually only between the functionality to calculate PAI, and not any of the PDAL dependent functions. That is to say, the benchmarks and comparison is made after reading the pointclouds. PDAL is primarily used to read, write, and preprocess the point cloud data. It is not used in the voxelization and metric calculation, which is the core contribution of the library. In the benchmarks, if we were to include the time it takes to read the data and calculate the metrics, the difference in time between our library and existing solutions would be much larger. Instead, we tried to make it as fair as possible. I hope this clarifies things!

Documentation Example usage: As you have prepared a demo notebook, please mention the demo notebook from the documentation.

The documentation has been enhanced significantly and now has all (we added more) the jupyter notebooks.

Software paper References: Citation of PDAL is missing and of uttermost importance given this package almost entirely relies on PDAL.

Thank you! This was important to add and we add added it both to the manuscript and the package documentation.

We hope that our revisions have helped to clarify the importance of this library and its contributions to science. We are planning to continue building the software to incorporate additional forest metrics -- like canopy cover, and other features such as tree segmentation.

iosefa / PyForestScan