glennon / GeyserTimes-Science

Resources to support the scientific use of the GeyserTimes platform
Other
1 stars 0 forks source link

Some Questions About the Data #32

Open idontgetoutmuch opened 6 years ago

idontgetoutmuch commented 6 years ago

I have downloaded the historical dataset for old faithful: http://www.geysertimes.org/archive/geysers/Old_Faithful_eruptions.tsv.gz. I am struggling to understand what some of the columns mean.

In particular, looking at the first row,

eruptionID  geyser  eruption_time_epoch has_seconds exact   ns  ie  E   A   wc  ini maj min q   duration    entrant observer    eruption_comment    time_updated    time_entered    associated_primaryID    other_comments  Old_Faithful_Preplay_Time_VEC   Old_Faithful_Height_VEC
23132   Old Faithful    10506540    0   1   0   0   0   0   0   0   1   0   0   4min    BoekelUpload    OFVCL-EV        1335129843  1335129843  23132   NULL    NULL    NULL

did the dataset really begin at (eruption_time_epoch):

*Main> epochToUTC 10506540
1970-05-02 14:29:00 UTC

and if so what do the time_updated and and time_entered mean?

*Main> epochToUTC 1335129843
2012-04-22 21:24:03 UTC

Perhaps the data was collected in 1970 but only added to your excellent site in 2012?

By row 86360 the consistency(?) seems to have improved

86360   Old Faithful    1310155500  0   1   0   0   0   0   0   0   0   1   0   1m46s   BoekelUpload    OFVCL-EV    (160+ft)    1352243080  1352243080  86360   NULL    NULL    NULL

So the eruption_time_epoch is

*Main> epochToUTC 1310155500
2011-07-08 20:05:00 UTC

and the time_updated and time_entered are

*Main> epochToUTC 1352243080
2012-11-06 23:04:40 UTC

Also the number of missing entries for duration seem to have increased over time. Is there any reason for this?

I am trying to validate the dataset that is available in the R programming language https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/faithful.html which I am beginning to suspect is not representative of old faithful's actual behaviour.

idontgetoutmuch commented 6 years ago

In case you are interested in what the R dataset looks like tests

The x-axis is duration and the y-axis is gap between eruptions.