gousiosg / github-mirror

Scripts to mirror Github in a cloudy fashion
BSD 2-Clause "Simplified" License
559 stars 106 forks source link

Wrong created_at values in commits.csv #97

Open huichen-cs opened 4 years ago

huichen-cs commented 4 years ago

Looks to me that the most recent dump (2019-06-01) has wrong time stamp. Below is how we may reproduce the error.

In http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06-01.tar.gz, I found the following,

$ grep "2038-01-18" commits.csv
239581704,"1e954ef83c79ccccb8f570d350ea0ff934650fbb",4498938,4498938,18281400,"2038-01-18 23:00:00"
242299182,"0ebc3ea059dfcc8124135b94dfaa40ba7d49e5b2",1340592,1340592,18533076,"2038-01-18 23:00:00"
606676461,"d4cbfbc31940a514f2974405b579ff8989f9a0b6",33691743,33691743,55773517,"2038-01-18 00:00:00"

I loaded the CSV files to a PostgreSQL database, and then do a query. These projects are,

ghtorrent=# SELECT id,url FROM projects WHERE id=18281400 OR id=18533076 OR id=55773517;
    id    |                              url
----------+---------------------------------------------------------------
 18281400 | https://api.github.com/repos/cirosantilli/test-streak-puppet2
 18533076 | https://api.github.com/repos/phy1729/illacceptanything
 55773517 | https://api.github.com/repos/nhm150030/jfox
(3 rows)

Having compared the dates in the CSV file with the commit log in these projects. The dates are indeed wrong. Similarly, I also found these,

$ grep "2028-12-31 23:00:00" commits.csv
239657882,"182ef308602eabf227265eb2a5034ede567dccfa",4498938,4498938,18281400,"2028-12-31 23:00:00"
242561682,"cd952389edd54aa00cd3a98382da1193a1fc6d36",1340592,1340592,18507496,"2028-12-31 23:00:00"

In fact, if we group by the duration of commits of projects where the duration is the years between the oldest and the newest commits of a project and count the number of projects in a group, I got the following,

 duration | count
----------+-------
        0 |   492
        1 |   632
        2 |   414
        3 |   363
        4 |   325
        5 |   321
        6 |   277
        7 |   214
        8 |   202
        9 |   150
       10 |   102
       11 |   106
       12 |    89
       13 |    82
       14 |    93
       15 |    60
       16 |    63
       17 |    51
       18 |    49
       19 |    38
       20 |    33
       21 |    24
       22 |     8
       23 |    11
       24 |    14
       25 |    13
       26 |     8
       27 |     4
       28 |     1
       29 |     2
       32 |     1
       33 |     1
       34 |     3
       36 |     1
       38 |     2
       40 |     3
       42 |     1
       43 |     1
       44 |     9
       45 |     7
       46 |    15
       47 |    10
       48 |     2
       49 |     7
       58 |     1
       68 |     2
(46 rows)

where count is the number of projects whose duration is given on the left column.

stuartbates commented 3 years ago

From the FAQ page:

I've seen weird commit timestamps

Git records the commit timestamp on the developer's workstation. If the clock is misconfigured, timestamps will be weird.

We have seen timestamps such as 0000-01-01 00:00 or 2034-12-31 23:59. GitHub and GHTorrent do not process the timestamps in any way.

One of the projects you referenced has future dates set against the authored at timestamps e.g. 2046-09-18T00:00:00Z

This problem affects all data on GH itself - here's a test commit I just made in 2021.

Unfortunately, I don't think there's anything we can do about it. At least instances of this should be pretty rare.