onnela-lab / forest

Forest is a library for analyzing smartphone-based high-throughput digital phenotyping data
https://forest.beiwe.org
BSD 3-Clause "New" or "Revised" License
28 stars 17 forks source link

Output type is float for `steps` column in Oak CSVs #226

Open biblicabeebli opened 1 year ago

biblicabeebli commented 1 year ago

I believe the issue where steps is output as a float is the following line:

steps_daily = np.full((len(days), 1), np.nan)

I'm not fluent in numpy to really be sure, but I do know that nan implies a floating point type (its a float-only concept), and the array is by construction one type.\

I tried the following change:

summary_stats = pd.DataFrame({
                    'date': days.strftime('%Y-%m-%d'),
                    'walking_time': walkingtime_daily[:, -1],
                    'steps': steps_daily[:, -1].astype(int),
                    'cadence': cadence_daily[:, -1]})

but this emits the runtime warning RuntimeWarning: invalid value encountered in cast 'steps': steps_daily[:, -1].astype(int),

Changing the declaration like this fixes that message

steps_daily = np.full((len(days), 1), 0) 

But this officially puts me in "I don't know if this has side effects" territory.

(I'm currently ignoring the separate code path in here for hourly steps, but I assume the same change applies there.)

This is all in oak.base.run

biblicabeebli commented 1 year ago

This issue isn't critical, but it does require a database migration over in beiwe if we ever want to change it.

hackdna commented 1 year ago

Could you provide some additional context as to why is the float output type an issue?

biblicabeebli commented 1 year ago

@hackdna this was a question that came from https://github.com/onnela-lab/beiwe-discussions/issues/109

The determination can be "float is fine" but, it does come out the other end of a serializer with an api in float form despite the concept (step count) being an integer. (It is a float in the CSV but the value is always an integer.)

biblicabeebli commented 1 year ago

I actually forgot to look at the output file from the proposed fix, I was just iterating trying to work out the numpy/pandas way to do it.