orcasound / orcanode

Software for live-streaming and recording lossy (AAC) or lossless compressed audio (HLS, DASH, FLAC) via AWS S3 buckets. :star:
GNU Affero General Public License v3.0
34 stars 12 forks source link

Move towards human-readable timestamps in audio filenames and/or directory names #7

Open scottveirs opened 6 years ago

scottveirs commented 6 years ago

In the long run, it would be valuable to stream and archive the Orcasound acoustic data with a NIST-synchronized timebase encoded in both the FLAC files and possibly also the HLS/DASH stream manifest and/or segments. If adjacent hydrophones (within earshot of each other) are synchronized with millisecond to microsecond precision, then we will be able to localize sounds with an accuracy that will help us learn more about biology: e.g. direction a soniferous animal is moving, location of a sound source, or identity of a signaler.

To this end, the shell script might be adapted (along with changes to how the player stays current) from its current syntax --

timestamp=$(date +%s)

-- to syntax such as:

timestamp=$(date +\%Y-\%m-\%d)

code source and snippet:

$ rsync -avz --delete --backup --backup-dir="backup_$(date +\%Y-\%m-\%d)" /source/path/ /dest/path By using $(date +\%Y-\%m-\%d) I’m telling it to use today’s date in the folder name.

scottveirs commented 3 years ago

It would be even better, as Paul pointed out on Slack recently, to get rid of the datetime-stamped S3 objects (akin to directories) and just store all data under a nodename with each data filename incorporating a NIST-synchronized timestamp.

We could get HLS segments to match the filename format of the FLAC files, which in the archive-orcasound-net bucket currently look something like: 2020-12-09_23-22-16_rpi_orcasound_lab--2.flac

Or we could align with ONC or OOI filename formats:

OOI: OO-HYVM2--YDH-2017-08-21T00_02_42.437000.mseed ONC: ICLISTENHF1293_20171226T145827.651Z.wav

mcshicks commented 3 years ago

I have this working now based on Pauls suggestion for using " -strftime 1"and modifying stream.sh (for research) to this "/tmp/$NODENAME/hls/$timestamp/%Y-%m-%d%H-%M-%S.ts" filename.

valentina-s commented 3 years ago

I think the more standard format is %Y-%m-%dTH:%M:%S.ts i.e. colons for the hours, and T instead of the _ (space is also used but bad for filenames). Also, what about milliseconds? I agree timezone indication will be good since I am never sure it is Greenwich time or local time.

ISO-8601 2021-08-11T18:01:50+00:00
UTC 2021-08-11T18:01:50Z

@Molkree you want to add your comments on the format?

mcshicks commented 3 years ago

I tried %Y-%m-%dTH:%M:%S.ts instead of %Y-%m-%d_%H-%M-%S.ts" and I could not get the player to work. Not sure if it's unhappy with the : or the TH (probably the :) but ffmpeg does write the files fine. I can look into milliseconds, but I think the rpi's time in probably only accurate to maybe 10 ms? It uses NTP to sync time.

paulcretu commented 3 years ago

ISO 8601 is a good idea, the full thing with timezone is %Y-%m-%dT%H:%M:%S%z. The problem is colons won't work on some filesystems (Windows), not sure if that has anything to do with it not working for you @mcshicks. I would propose something like %Y-%m-%d_%H-%M-%S_%Z (2021-08-12_20-52-09_UTC). It's readable, portable, and easy-ish to translate into ISO 8601.

The timezone could be easier to translate with %z (e.g. 2021-08-12_20-52-09+0000) since you wouldn't have to look up the abbreviation (like PDT in 2021-08-12_20-52-09_PDT). But there might be some cases where the + is a problem, and with negative offsets, it's a bit confusing to have the - (2021-08-12_20-52-09-0700). It would be nicest to get 2021-08-12_20-52-09Z for UTC and +0000 offset notation for other timezones but that doesn't seem to be an option with strftime.

Molkree commented 3 years ago

The problem is colons won't work on some filesystems (Windows)

Haha, actually you can't even upload such files using actions/upload-artifact#35 in GitHub workflows. I used colons at first but then changed it to this %Y-%m-%dT%H-%M-%S-%f

Haven't thought about timezone, I just used UTC everywhere I believe. If you do add it to the filename I'd also prefer +0000. If extra - at the end looks confusing, can always add delimiter like TZ or something (2021-08-12_20-52-09TZ-0100).

valentina-s commented 3 years ago

I did not think about the colons. The OOI Archive has them but I guess this causes issues for some users. The format without dashes and colons %Y%m%dT%H%M%S%Z is also supported by ISO 8601. I wonder if that can be run by the player? It may be less human readable but is also machine readable. I am more biased toward using something standard. The fractions are expected to be delimited with dots (or commas) to distinguish 01.05 (1 h 3 min) vs 01:05 (1h 5min). If there are no dashes before, maybe then the -/+ timezone will be more obvious. Is the local timezone preferred? It is only one but it may not be obvious to a non-local person.

Molkree commented 3 years ago

Is the local timezone preferred? It is only one but it may not be obvious to a non-local person.

Right now we use Unix time so I'd prefer to stay with UTC. Not specifying time zone implies local time so fully compliant ISO 8601 UTC time without colons would look like 20210812T205209+0000, 20210812T205209+00 or 20210812T205209Z.

I personally don't care that much about strict standard adherence in this case and would prefer something more readable but still in UTC.

scottveirs commented 1 year ago

@tsuize @veirs this is the HLS timestamp issue I was seeking on today's call. I think we should tackle this formatting decision this winter, adjust the orcanode code accordingly, and then fix everything that we're going to break, including at least:

scottveirs commented 1 year ago

After looking at MBARI's Pacific Sound open data registry a bit, they seem to be using something like this:

2017-06-13T16:00:00

and John Ryan confirms via Slack that this is relying on the convention of scientific timestamps being assumed to be in the UTC time zone.

Personally, I find the ambiguity unnerving enough that I think it's worth resolving with the extra 3 characters +00...

So, I'd propose one of the following options:

  1. 20170613T160000+00
  2. 20170613-160000+00 which I find just barely human-readable enough
  3. 2017-06-13T16-00-00+00
  4. 2017-06-13_16-00-00+00 which I feel is the most human-readable while avoiding colons :

Or just use Modified Julian Date (MJD) for the filenames and utilize existing packages to decode into human-readable formats if/when necessary.

Opinions?

scottveirs commented 1 year ago

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

scottveirs commented 1 year ago

@ben-hendricks shared on a call today that the BC Hydrophone Network uses a custom driver to generate timestamps from their icListen hydrophones in this format:

ICLISTENHF1281_20190704T085500.000Z_20190704T090000.000Z.flac

Where 1291 is the instrument ID (serial number?) and the .000 suffix is precision in seconds.

The archived format for processed calibrated noise level files assumes the user knows the timestamp is in UTC time zone, so ends up as (or close to?):

1281_20190704T085500.wav

scottveirs commented 1 year ago

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

Related to this @ben-hendricks also made a good point that -- if possible -- it's ideal to have different nodes start their recordings on the minute (or they use a 5-minute interval) so that file names and time intervals end up being consistent across the network. This allows a direct request for a matching file, rather than a search through ~20k files for the desired matching time period from another location (e.g. for localization).

ben-hendricks commented 1 year ago

As a comment to @scottveirs suggestion regarding filename convention and time synchronization: A change in filename convention is usually a small step, from a coding perspective. Synchronizing recording periods gave our coding team some headaches because we also wanted to be sure that all files have a predictable length (those with different length were re-named so that a search algorithm could filter them). However, in our experience the benefits outweigh the costs. a) It is a virtual requirement to x-correlate and localize transient signals. b) any match between a timestamp and a corresponding audio file can be made instantaneously.

scottveirs commented 1 year ago

Great advice @ben-hendricks . Thanks for sharing insights from the BC Hydrophone Network!

I've created two orcanode issues based on your input:

scottveirs commented 1 year ago

@ben-hendricks shared on a call today that the BC Hydrophone Network uses a custom driver to generate timestamps from their icListen hydrophones in this format:

ICLISTENHF1281_20190704T085500.000Z_20190704T090000.000Z.flac

Where 1291 is the instrument ID (serial number?) and the .000 suffix is precision in seconds.

The archived format for processed calibrated noise level files assumes the user knows the timestamp is in UTC time zone, so ends up as (or close to?):

1281_20190704T085500.wav

These details ^^^ from Ben may be of interest @valentina-s @savageGrant @CaseCal @mitchhaldeman

scottveirs commented 1 year ago

@ben-hendricks Can you confirm/deny that the .000 part of the ICLISTEN file name is precision in seconds (rather an indication of zero hours offset from UTC (Z) time)?

CaseCal commented 1 year ago

Thanks @scottveirs and @ben-hendricks, this is helpful and timely as we're juts developing our file naming and access tool.

I notice in that example that the .flac file contains a start and end time, while the wav file has just a start time. Is there any standard or preference to including only start time, start time and end time, or start time and duration? Especially as we gear towards efficient storage in our own project, we may not have conveniently sized archive file durations.

My though is having start time and end time makes it the easiest to scan files for a specific timestamp or period, but it also starts to become somewhat verbose.

ben-hendricks commented 1 year ago

Hi all,

I would not worry about verbosity … as in the end most files are typically handled by an algorithm. The caveat (or one of them) for including end-time is that the filename is created when the file/recording is created at which point the end time is not known. So your logging algorithm could either

In any case your algorithm would have to read/write to the filename twice, not only once, as far as I see it.

Cheers, Ben

On Feb 3, 2023, at 4:11 PM, Caleb Case @.***> wrote:

Thanks @scottveirs https://github.com/scottveirs and @ben-hendricks https://github.com/ben-hendricks, this is helpful and timely as we're juts developing our file naming and access tool.

I notice in that example that the .flac file contains a start and end time, while the wav file has just a start time. Is there any standard or preference to including only start time, start time and end time, or start time and duration? Especially as we gear towards efficient storage in our own project, we may not have conveniently sized archive file durations.

My though is having start time and end time makes it the easiest to scan files for a specific timestamp or period, but it also starts to become somewhat verbose.

— Reply to this email directly, view it on GitHub https://github.com/orcasound/orcanode/issues/7#issuecomment-1416547980, or unsubscribe https://github.com/notifications/unsubscribe-auth/A374DCPJTW4FUVCKVV7GSQLWVWNBNANCNFSM4E6563XA. You are receiving this because you were mentioned.

-- Benjamin Hendricks, PhD__ SoundSpace Analytics @. @.> (+1) 250 532 3179

scottveirs commented 1 year ago

@ben-hendricks Can you confirm/deny that the .000 part of the ICLISTEN file name is precision in seconds (rather an indication of zero hours offset from UTC (Z) time)?

Thanks to facilitation by @ben-hendricks , Tom Dakin confirms via email:

Yes the .000 are milliseconds.

scottveirs commented 1 year ago

Noting that MANTA (Matlab-based noise analysis software) says this about datetime formats:

The preferred time/date format in the filename is yyyymmdd_HHMMSS (HHMMSS.FFF is also acceptable).

The date/time information can be located at any position within the filename. To aid users in renaming their acoustic data files to be compatible with MANTA software, a file renaming tool (Sox-o-matic) is available from The Cornell Lab of Ornithology Center for Conservation Bioacoustics:

Sox-o-matic Wiki: https://bitbucket.org/CLO-BRP/sox-o-matic/wiki/Home

Sox-o-matic Software download: https://www.birds.cornell.edu/ccb/sox-o-matic/

scottveirs commented 1 year ago

Also, we should test whether we can ensure ffmpeg can write a file with data starting at YYMMDD-HHMMSS precisely (to the nearest 10 or 100 microseconds). Otherwise we may need or want to add precision within the filename, i.e. precision high enough for any future localization efforts (e.g. 10 or 100 microseconds?).

See Steve's thoughts in this other orcanode issue for more info about achieving high precision with ffmpeg...

scottveirs commented 1 year ago

Comparing readability of these two options, for fun:

20190704T085500.000Z (BCHN format) 20190704_092314.000Z (Proposed Orcasound format)

And noting that OOI added a lot of precision beyond MBARI, but neither added a Z or +00...

2017-06-13T16:00:00 (MBARI format, relying on convention of scientific timestamps defaulting to UTC time zone) 2021-08-04T00:20:00.000015 (OOI)