Where does the time interval in the recordings come from?

kbjarkefur commented 6 years ago

Thank you for this great tool! We are using the Science Journal to collect meta data during data collection using tablets. The meta data will answer questions on the environment the data was collected in, dark/light, loud/quiet etc.

The data collections that we will record meta data during are sometimes hours long so we are looking into ways to somehow generate summary stats of the data recorded over a time period of one or a few seconds and save that to file. But in order to decide how to do this best, we frst wanted to understand the nature of the sensor data better so we ran a pilot.

We piloted this using the Science Journal app, and it looks promising but when exploring our pilot data we saw both a pattern and randomness in the duration between each recording, and we would like to know if anyone in this forum that can shed some light on why it both has a pattern and is random. See the two graphs below of one very clear example. The left is a histogram of the duration between each recording of pitch, and the right is a scatter plot with the duration since previous recording on the y-scale and time into the recording on the x-scale. In the histogram it is clear that the duration of multiplies of 40ms is important. But there is also an non-negligible amount of recordings that does not follow that pattern. We do not want to disregard them, because what if those are correlated with something we want to measure in our environment meta data.

datanum10_pitch

Is this something that depends on the device? We have tried with a couple of devices and whle some details are different, we get the same big picture results. In the example I used for the graph above the differences is quite uniform throughout the recording as can be seen in the scatter plot. But I have other graphs where the scatter plot shows that different parts of the recording have different patterns, or more/less distinct patterns.

We have also been thinking if this is due to a delivery rate setting in the sensor capturing in the implementation of the science journal app. But that would not explain where the random duration recordings are coming from. Or where the value 40ms comes from. In the light variable a similar value is about 100ms and in the accelerometer it is about 60ms.

Any advice on this would be highly appreciated. I am happy to provide more information if needed, but this post is already very long. Sorry if it is too long.

Once we know what we want to do we will include the open source code into our own app, but until then we are running the science journal app separately.

Suggestions of completely different but better ways to reduce the size of the data is much welcomed. We need the data to be smaller than the raw data in the science app and we need the unit of observation to be time uniform somehow.

dsaff commented 6 years ago

Hi, Kristoffer!

Sorry for the delay, I'm the main person who watches this issue list, and have been on vacation.

The short answer is that we are not attempting to make real-time guarantees. For example, pitch analysis can vary a fair bit in how long analysis takes depending on the form the input takes.

Can you give a bit more information about how this variability affects you? For example, are you trying to compute an average, and want to make sure it's accurate?

Thanks!

David

chrislrobert commented 6 years ago

Thanks, David. I'll jump in for Kristoffer because I think that he's also traveling just now.

Our interest is in converting the non-uniform data stream into a more-uniform data stream that can, for example, be averaged without some periods (with frequent observations) being weighed more heavily than others (with less-frequent observations). Our first step was to try to understand the original source of the heterogeneity in period length, so that we could then take the right approach to sampling down to a more-uniform stream of observations.

Also, BTW, we don't actually see the pitch code here in the repo, so we've been collecting pitch but can't see how the underlying stream is constructed. Perhaps we just missed it, or it's in a version that hasn't been committed to this repo?

Thanks again,

Chris

dsaff commented 6 years ago

Chris,

Re the pitch code, we're behind on updating the repo, and unfortunately probably won't catch up for another month.

David

chrislrobert commented 6 years ago

Okay, thanks for letting me know. We were kind of tearing our hair out trying to figure out why what we were seeing in the app and what we were seeing in the code seemed to be so different.

If you have any interest in the further-afield ways your Science Journal work might be benefiting the world, you can see my recent blog post on our machine-learning roadmap. Sensor streams play a key role:

Enriching non-PII meta-data with machine-learning algorithms in mind. Some meta-data, like random audio recordings, might be very useful to human reviewers but not very useful to machine-learning algorithms. On the other hand, some other meta-data, like data from device sensors, might be very useful to machine-learning algorithms but less so to humans. We’re actively researching and experimenting with sensor data streams, in order to add new meta-data options to SurveyCTO. We’re particularly interested in ways that we can condense sensor data down into non-PII (non-personally-identifiable) statistics that might help machine-learning algorithms predict the quality of a submission without posing any risk of revealing sensitive data.

This is obviously pretty far from science education, but hopefully you find socially-valuable spin-off efforts a positive thing.

Thanks again for letting us know,

Chris

googlearchive / science-journal

Where does the time interval in the recordings come from? #32