comments to N. Brown's paper

Elchin commented 2 years ago

Overall, it is a timely and important software that still requires a lot of work due heterogeneity of the permafrost datasets. However, this is the first substantial attempt in the right direction. Standardization of the data collection and processing procedures are required to make this software universal. I have some comments about permafrost data:

The permafrost data are not only limited to the ground temperature data, but active layer thickness, soil moisture, snow depth, air temperature, organic layer thickness, etc. (Wang et al., 2020), as well as airborne and remotely sensed data (Clayton et al., 2021).
Data gap-filling: What are the existing gap-filling methods available in pandas? how can they be adapted by tsp? what is needed to develop robust gap-filling methods?
QA/QC checking methods is an important part of the data pre-processing process. Can this software be used to help standardize data QA/QC-ing process?

References Leah K Clayton et al 2021 Environ. Res. Lett. 16 055028

Wang, K., Jafarov, E., Overeem, I., Romanovsky, V., Schaefer, K., Clow, G., Urban, F., Cable, W., Piper, M., Schwalm, C., Zhang, T., Kholodov, A., Sousanes, P., Loso, M., and Hill, K.: A synthesis dataset of permafrost-affected soil thermal conditions for Alaska, USA, Earth Syst. Sci. Data, 10, 2311–2328, https://doi.org/10.5194/essd-10-2311-2018, 2018.

nicholas512 commented 2 years ago

Hi @Elchin. Responses to your comments are below. Where applicable I’ve made notes of any immediate actions taken in each subsection. If you feel that additional, specific changes should be made to the library or paper at this point, please let me know. I’d also be happy to continue the discussions on gap-filling, QA/QC and how these could become more standardized in the community.

1: Variety of Permafrost Data

Response

It is certainly important to recognize the variety of data that are relevant to permafrost science. The tsp package has a narrow scope in that it (currently) is only designed for ground temperature time series data although it could be adapted fairly easily for any time series observations at multiple depths (such as soil moisture).

I think there’s merit in developing similar shared software tools to handle other kinds of permafrost data such as those you’ve mentioned. But it makes sense to me for these to be separate libraries.

Action taken

An acknowledgement of the importance of other kinds of permafrost data has been added to the paper in commit 2950e4de using the ESSD reference provided.

2: Gap filling

Response

Within pandas, I’m only aware of the fillna and resample methods for gap-filling. Although there would surely be other libraries for gap-filling time series data.

I suppose there are two components to gap filling: Filling in any missing time periods with NA values and then imputing data. Certainly, it would be easy to provide a method for the former if the target sampling frequency were provided (although this may change over the course of data collection). Handling the latter requires more consideration of how the data will be used. Simple back- or forward-filling provided in pandas is suitable near or below the depth of zero annual amplitude but could be wildly inappropriate in the active layer. At shallower depths, linear interpolation may be suitable but only for short temporal gaps.

My intuition is that robust automated methods for imputing ground temperature data would need to use all available information from the dataset: depths, times, and other values. This will require careful consideration and testing. However, as discussed below, the tsp package provides a starting point that can help standardize methods across the community regardless of source data format.

3: QAQC

Response

One of the motivations for the tsp package was indeed the need for rapid, effective QA/QC of both sensor and model data. Too often, researchers re-write code for either reading their data into data frames or for making generic plots for visual inspection (often the first and only check performed on a dataset). This version of tsp already streamlines these two phases of the QA/QC process.

The next step of standardizing QA/QC should be the development of a set of generic and open-source functions for identifying suspicious or implausible data. These should be as formatting-agnostic as possible, following the example of the plotting functions in tsp. Some initial thought has been put into this in a separate data cleaning library, which may be integrated into tsp in the future. Some of these checks are simple and already widely used (e.g. upper and lower cut-off thresholds). More sophisticated QA/QC checks that consider depth and rates-of-change may also be worth investigating (some of which are described here.

The tsp library helps to standardize QA/QC in three ways. First, it provides a common starting point for data from various sensors and databases. This means that QA/QC techniques that are developed by one research group using tsp can be more easily adopted by others. Second, the QA/QC functions described above could be added as a module of the tsp package, creating more of a ‘one-stop-shop’ for temperature data handling. Finally, the TSP class itself provides a way to streamline running those functions by doing the data manipulation behind the scenes. By providing an easy QA/QC solution, more people are likely to adopt the same techniques, creating a de facto standard.

Elchin commented 2 years ago

I suggest including the gap-filling and data QA-/QC discussion in the paper. These are very important issues when it comes to data processing. I am leaving that up to you. My review is complete. Thank you for your work!

openjournals / joss-papers