Open huichen-cs opened 4 years ago
From the FAQ page:
I've seen weird commit timestamps
Git records the commit timestamp on the developer's workstation. If the clock is misconfigured, timestamps will be weird.
We have seen timestamps such as 0000-01-01 00:00 or 2034-12-31 23:59. GitHub and GHTorrent do not process the timestamps in any way.
One of the projects you referenced has future dates set against the authored at timestamps e.g. 2046-09-18T00:00:00Z
This problem affects all data on GH itself - here's a test commit I just made in 2021.
Unfortunately, I don't think there's anything we can do about it. At least instances of this should be pretty rare.
Looks to me that the most recent dump (2019-06-01) has wrong time stamp. Below is how we may reproduce the error.
In http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06-01.tar.gz, I found the following,
I loaded the CSV files to a PostgreSQL database, and then do a query. These projects are,
Having compared the dates in the CSV file with the commit log in these projects. The dates are indeed wrong. Similarly, I also found these,
In fact, if we group by the duration of commits of projects where the duration is the years between the oldest and the newest commits of a project and count the number of projects in a group, I got the following,
where
count
is the number of projects whoseduration
is given on the left column.