se-sic / coronet

coronet – the R library for configurable and reproducible construction of developer networks
GNU General Public License v2.0
7 stars 15 forks source link

Wrong data type of NA timestamps depending on order of data sources #269

Open bockthom opened 5 hours ago

bockthom commented 5 hours ago

Description

project.data$get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits")) sometimes fails if some of the data sources are empty, ending up in the following error message:

ERROR::The bins parameter, if present, needs to be a vector whose elements represent dates

This error occurs if the issue data is the empty data source, but not if the mail data is the empty data source, which is an unexpected behavior.

After some debugging, I noticed that the error is caused by wrong data types when converting strings to POSIXct dates: Usually, if the date is not a POSIXct object, it is converted to POSIXct when splitting the data: https://github.com/se-sic/coronet/blob/24005e44aaf602789168f69dbcd5cc0a077c0194/util-split.R#L69-L70 However, this fails if the bins (i.e., the timestamps used for cutting) are not a string but a numeric (such as UNIX timestamps). Therefore, we need to look at where the timestamps for cutting come from: They are extracted in the function extract.timestamps: https://github.com/se-sic/coronet/blob/24005e44aaf602789168f69dbcd5cc0a077c0194/util-data.R#L724-L758

If no data is available for the specified data source, the corresponding timestamps are set to NA. For whatever reason, this converts the data frame used for cutting into a numeric data frame of UNIX timestamps if the first row contains NA values. Otherwise, if there are already POSIXct objects in there, the inserted NA values are interpreted to be POSIXct objects.

Suggested Fix

Instead of setting the timestamps to NA in case of missing data sources, set the timestamps to as.POSIXct(NA). This way, we ensure that the data type is POSIXct even if the NA values are the very first ones that are inserted into the timestamps data frame.

Versions

This affects, at least, coronet version 4.4 in combination with R version 4.4.1. Also earlier coronet versions or R versions may be affected by this bug.

bockthom commented 5 hours ago

I have already implemented a fix for that, see the linked patch: 0001-Ensure-correct-data-type-of-NA-timestamps.patch

However, we also should properly test that, to make sure that the patch properly works in various situations and does not break anything.