Closed Elchin closed 2 years ago
Hi @Elchin. Responses to your comments are below. Where applicable I’ve made notes of any immediate actions taken in each subsection. If you feel that additional, specific changes should be made to the library or paper at this point, please let me know. I’d also be happy to continue the discussions on gap-filling, QA/QC and how these could become more standardized in the community.
It is certainly important to recognize the variety of data that are relevant to permafrost science. The tsp package has a narrow scope in that it (currently) is only designed for ground temperature time series data although it could be adapted fairly easily for any time series observations at multiple depths (such as soil moisture).
I think there’s merit in developing similar shared software tools to handle other kinds of permafrost data such as those you’ve mentioned. But it makes sense to me for these to be separate libraries.
An acknowledgement of the importance of other kinds of permafrost data has been added to the paper in commit 2950e4de
using the ESSD reference provided.
Within pandas
, I’m only aware of the fillna
and resample
methods for gap-filling. Although there would surely be other libraries for gap-filling time series data.
I suppose there are two components to gap filling: Filling in any missing time periods with NA values and then imputing data. Certainly, it would be easy to provide a method for the former if the target sampling frequency were provided (although this may change over the course of data collection). Handling the latter requires more consideration of how the data will be used. Simple back- or forward-filling provided in pandas is suitable near or below the depth of zero annual amplitude but could be wildly inappropriate in the active layer. At shallower depths, linear interpolation may be suitable but only for short temporal gaps.
My intuition is that robust automated methods for imputing ground temperature data would need to use all available information from the dataset: depths, times, and other values. This will require careful consideration and testing. However, as discussed below, the tsp package provides a starting point that can help standardize methods across the community regardless of source data format.
One of the motivations for the tsp package was indeed the need for rapid, effective QA/QC of both sensor and model data. Too often, researchers re-write code for either reading their data into data frames or for making generic plots for visual inspection (often the first and only check performed on a dataset). This version of tsp already streamlines these two phases of the QA/QC process.
The next step of standardizing QA/QC should be the development of a set of generic and open-source functions for identifying suspicious or implausible data. These should be as formatting-agnostic as possible, following the example of the plotting functions in tsp. Some initial thought has been put into this in a separate data cleaning library, which may be integrated into tsp in the future. Some of these checks are simple and already widely used (e.g. upper and lower cut-off thresholds). More sophisticated QA/QC checks that consider depth and rates-of-change may also be worth investigating (some of which are described here.
The tsp library helps to standardize QA/QC in three ways. First, it provides a common starting point for data from various sensors and databases. This means that QA/QC techniques that are developed by one research group using tsp can be more easily adopted by others. Second, the QA/QC functions described above could be added as a module of the tsp package, creating more of a ‘one-stop-shop’ for temperature data handling. Finally, the TSP class itself provides a way to streamline running those functions by doing the data manipulation behind the scenes. By providing an easy QA/QC solution, more people are likely to adopt the same techniques, creating a de facto standard.
I suggest including the gap-filling and data QA-/QC discussion in the paper. These are very important issues when it comes to data processing. I am leaving that up to you. My review is complete. Thank you for your work!
Overall, it is a timely and important software that still requires a lot of work due heterogeneity of the permafrost datasets. However, this is the first substantial attempt in the right direction. Standardization of the data collection and processing procedures are required to make this software universal. I have some comments about permafrost data:
References Leah K Clayton et al 2021 Environ. Res. Lett. 16 055028
Wang, K., Jafarov, E., Overeem, I., Romanovsky, V., Schaefer, K., Clow, G., Urban, F., Cable, W., Piper, M., Schwalm, C., Zhang, T., Kholodov, A., Sousanes, P., Loso, M., and Hill, K.: A synthesis dataset of permafrost-affected soil thermal conditions for Alaska, USA, Earth Syst. Sci. Data, 10, 2311–2328, https://doi.org/10.5194/essd-10-2311-2018, 2018.