Split files on start of hour

janipewter commented 4 years ago

Is it possible to automatically split on the start of the hour? I'm looking at logging some high frequency serial data from a scientific instrument and I'd like to split the files per hour, but it needs to be accurate.

So far the only way I can think of is to use -e 3600 and invoke grabserial by a cron job on the start of the hour or something, but I need the splitting to be really accurate as I will potentially be dealing with serial data that's coming in at 50Hz.

If anyone has any suggests I would love to hear them! Thanks

AlphaSierraHotel commented 4 years ago

Do you need to split them at the time of capture? If not, I would suggest capturing data by date and them using a short script run from cron one per day to split the previous days log into hourly chucks based on the timestamp in the data.

Allen Hubble - Hubble Technology Consulting Inc.

On Thu, Apr 30, 2020 at 4:33 AM Jani notifications@github.com wrote:

Is it possible to automatically split on the start of the hour? I'm looking at logging at logging some high frequency serial data from a scientific instrument and I'd like to split the files per hour, but it needs to be accurate.

So far the only way I can think of is to use -e 3600 and invoke grabserial by a cron job on the stat of the hour or something, but I need the splitting to be really accurate as I will potentially be dealing with serial data that's coming in at 50Hz.

If anyone has any suggests I would love to hear them! Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tbird20d/grabserial/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHETQUCKQKDMY4M66IB7BO3RPEZVTANCNFSM4MVJSDJQ .

janipewter commented 4 years ago

Ideally yes I need to split them at time of capture.

AlphaSierraHotel commented 4 years ago

Then I'd suggest you write a script to run once per hour to extract the previous hour's data from the log file or use your idea above to keep the capture timeframes to one hour each. Use -a to restart the capture.

tbird20d commented 4 years ago

Are you familiar with the -a option? This will automatically restart grabserial after the period specified by '-e', and if a log format includes the log time then grabserial will output new data to a new log with a name that includes the new time-and-date-stamp. You never leave the program (so no crontab is needed), but this does close and re-open the serial port, so there's the possibility of data loss.

Grabserial also includes an feature where it will start a new log every day at midnight, when the log filename includes a date. I think we could use logic similar to this to have grabserial roll over to a new log after -e intervals (or maybe just on an hour change, but that would be very specific to just your needs). The rollover would happen without closing the serial port, so the risk of losing data would be much smaller. Let me know if you'd like to work on this, or alternately if you'd like me to take a look at this, and I'll see what I can do.

janipewter commented 4 years ago

I did experiment with the -a option but on my test system it was not accurate enough, and after a while would start to drift. I made a test script to start grabserial exactly on the start of a minute, with -e 60 and -a options. After a while, I noticed it was gradually drifting later and later and new files were being started at 4 seconds past the minute etc. For my use case this is unacceptable; I need absolute accuracy in the splitting. I am now looking at alternative methods of doing it, for example splitting once every few hours and then running external scripts to build the split files by reading the time stamp on each line.

Edit:

if grabserial could split by timestamp that would be the ideal solution. I'm using the built in time stamp with custom format like so:

[2020-04-30 20:48:12.238364] 64 bytes from 1.1.1.1: icmp_seq=15490 ttl=58 time=12.658 ms

The ping is just for test purposes, it was the only thing I had to hand which could generate a reliable constant stream of serial output, albeit at only 5Hz (and I'll potentially be playing with 50Hz data when this is put in production).

tbird20d commented 4 years ago

OK - I added a -R feature, to specifically do rotation. I just pushed it to the master branch. Please take a look at the new usage. This should avoid data loss, and be more accurate, because it avoids doing a close and reopen, but in my testing it still drifted a bit.

Let me think about the drift and what might be causing it. It might be latencies in the read operation. Can you try it out and let me know if it works better for you?

tbird20d commented 4 years ago

One idea I have is to calculate the next rotation time as a base time plus a multiple of the specified rotation interval, instead of basing it on the current time plus the interval. That should keep things as sync'ed with wall time as possible. Let me know what you think.

janipewter commented 4 years ago

OK - I added a -R feature, to specifically do rotation. I just pushed it to the master branch. Please take a look at the new usage. This should avoid data loss, and be more accurate, because it avoids doing a close and reopen, but in my testing it still drifted a bit.

Let me think about the drift and what might be causing it. It might be latencies in the read operation. Can you try it out and let me know if it works better for you?

OK thanks. I did not realise it closed and opened the serial port at each split, and I did notice that even at 5Hz there were one or two lines that were missed during the split.

I still think some way of splitting based on the time stamp is the best way (at least for my use case). Even a single line of data that spills over from/to the previous/next hour's file is potentially an issue. I need each hour file only contain lines from that hour.

I will give the updated version a test and report back. In my tests I am splitting every 60 seconds just to save time (no pun intended)

tbird20d commented 4 years ago

One more question... Is your data coming in at a steady pace? In my testing I was using a 10-second rotation interval on a Linux boot operation. The data was coming at odd intervals, and there were gaps, which may have accounted for the drift I was seeing.

janipewter commented 4 years ago

Yes, both my test data and my production data are at a steady pace.

My test data is simply a ping at 200ms intervals (the fastest you can do in FreeBSD) = 5Hz evenly spaced.

My production data is a mix of scientific instruments which all output a single line of data at evenly spaced intervals; some are faster than others, the outputs range from between 0.1Hz to 50Hz.

janipewter commented 4 years ago

Running new version with -R 60 and -o "/opt/output/%F_%T_ttyUSB0.txt" runs until the duration of -R is met for the first time at which point there is an exception:

  File "/opt/grabserial/grabserial", line 870, in <module>
    restart_requested = grab(sys.argv[1:])
  File "/opt/grabserial/grabserial", line 630, in grab
    out_filename = datetime.datetime.now().strftime(out_pattern)

Edit:

my full execution line:

/opt/grabserial/grabserial -S -d /dev/$device -T -F "%F %T.%f" -A -n -R 60 -o "/opt/output/%F_%T_$device.txt"

This is being called from a script where $device references a variable which contains the serial device name as a string

tbird20d commented 4 years ago

Crud. Let me take a look at this. Sorry about that.

janipewter commented 4 years ago

Updated my previous reply for clarity.

tbird20d commented 4 years ago

Ok - I just pushed a new commit to fix the bug (I hope). Can you try it again? It's still drifting a bit in my testing, so I may try my 'multiples of the rotation interval' technique, to see if I can get more precision on the rotation. But I'm out of time to work on this today. Ping me tomorrow or next week if you don't hear any update from me.

I wasn't familiar with %F or %T. You learn something new every day. :-)

By the way, I checked https://docs.python.org/2/library/datetime.html and I don't see %F or %T documented, but they're clearly accepted by datetime's strftime (and they're document in 'man strftime - the C version of the routine). Hmmm, looks like a doc bug for python.

tbird20d commented 4 years ago

Well, I needed to read better. It says on that page you need to check the C strftime documentation for your OS.

janipewter commented 4 years ago

I've never really touched python2, only python3, and I'm a fairly heavy user of %F and %T with strftime.

Anyway, I'm testing the updated code now, and it's working but it is still drifting. I started running the grab at 20:05 (BST) with -R 60.

The 20:05 file starts as so:

[2020-05-01 20:05:00.552577] 64 bytes from 1.1.1.1: icmp_seq=20717 ttl=58 time=21.447 ms
[2020-05-01 20:05:00.750263] 64 bytes from 1.1.1.1: icmp_seq=20718 ttl=58 time=13.421 ms
[2020-05-01 20:05:00.971514] 64 bytes from 1.1.1.1: icmp_seq=20719 ttl=58 time=22.065 ms
[2020-05-01 20:05:01.169586] 64 bytes from 1.1.1.1: icmp_seq=20720 ttl=58 time=13.271 ms
[2020-05-01 20:05:01.382016] 64 bytes from 1.1.1.1: icmp_seq=20721 ttl=58 time=13.356 ms
[2020-05-01 20:05:01.595942] 64 bytes from 1.1.1.1: icmp_seq=20722 ttl=58 time=14.449 ms
[2020-05-01 20:05:01.807461] 64 bytes from 1.1.1.1: icmp_seq=20723 ttl=58 time=14.469 ms
[2020-05-01 20:05:02.020410] 64 bytes from 1.1.1.1: icmp_seq=20724 ttl=58 time=14.698 ms
[2020-05-01 20:05:02.226849] 64 bytes from 1.1.1.1: icmp_seq=20725 ttl=58 time=13.171 ms
[2020-05-01 20:05:02.439413] 64 bytes from 1.1.1.1: icmp_seq=20726 ttl=58 time=13.311 ms
[2020-05-01 20:05:02.642525] 64 bytes from 1.1.1.1: icmp_seq=20727 ttl=58 time=15.127 ms
[2020-05-01 20:05:02.851541] 64 bytes from 1.1.1.1: icmp_seq=20728 ttl=58 time=12.568 ms
[2020-05-01 20:05:03.067426] 64 bytes from 1.1.1.1: icmp_seq=20729 ttl=58 time=15.727 ms

and ends as so:

[2020-05-01 20:05:57.996164] 64 bytes from 1.1.1.1: icmp_seq=20990 ttl=58 time=23.144 ms
[2020-05-01 20:05:58.197657] 64 bytes from 1.1.1.1: icmp_seq=20991 ttl=58 time=12.623 ms
[2020-05-01 20:05:58.419273] 64 bytes from 1.1.1.1: icmp_seq=20992 ttl=58 time=22.048 ms
[2020-05-01 20:05:58.625283] 64 bytes from 1.1.1.1: icmp_seq=20993 ttl=58 time=16.157 ms
[2020-05-01 20:05:58.843222] 64 bytes from 1.1.1.1: icmp_seq=20994 ttl=58 time=21.225 ms
[2020-05-01 20:05:59.043542] 64 bytes from 1.1.1.1: icmp_seq=20995 ttl=58 time=14.411 ms
[2020-05-01 20:05:59.254963] 64 bytes from 1.1.1.1: icmp_seq=20996 ttl=58 time=13.710 ms
[2020-05-01 20:05:59.467303] 64 bytes from 1.1.1.1: icmp_seq=20997 ttl=58 time=14.238 ms
[2020-05-01 20:05:59.679354] 64 bytes from 1.1.1.1: icmp_seq=20998 ttl=58 time=14.391 ms
[2020-05-01 20:05:59.892819] 64 bytes from 1.1.1.1: icmp_seq=20999 ttl=58 time=14.665 ms
[2020-05-01 20:06:00.099941] 64 bytes from 1.1.1.1: icmp_seq=21000 ttl=58 time=14.659 ms

The 20:20 file starts as so:

[2020-05-01 20:20:02.670324] 64 bytes from 1.1.1.1: icmp_seq=25006 ttl=58 time=23.686 ms
[2020-05-01 20:20:02.872241] 64 bytes from 1.1.1.1: icmp_seq=25007 ttl=58 time=13.663 ms
[2020-05-01 20:20:03.084444] 64 bytes from 1.1.1.1: icmp_seq=25008 ttl=58 time=13.675 ms
[2020-05-01 20:20:03.287983] 64 bytes from 1.1.1.1: icmp_seq=25009 ttl=58 time=14.050 ms
[2020-05-01 20:20:03.491762] 64 bytes from 1.1.1.1: icmp_seq=25010 ttl=58 time=12.484 ms
[2020-05-01 20:20:03.714893] 64 bytes from 1.1.1.1: icmp_seq=25011 ttl=58 time=22.678 ms
[2020-05-01 20:20:03.917647] 64 bytes from 1.1.1.1: icmp_seq=25012 ttl=58 time=13.102 ms
[2020-05-01 20:20:04.139022] 64 bytes from 1.1.1.1: icmp_seq=25013 ttl=58 time=21.416 ms
[2020-05-01 20:20:04.352471] 64 bytes from 1.1.1.1: icmp_seq=25014 ttl=58 time=22.696 ms
[2020-05-01 20:20:04.550637] 64 bytes from 1.1.1.1: icmp_seq=25015 ttl=58 time=13.787 ms
[2020-05-01 20:20:04.764537] 64 bytes from 1.1.1.1: icmp_seq=25016 ttl=58 time=14.938 ms

and ends as so:

[2020-05-01 20:20:58.997206] 64 bytes from 1.1.1.1: icmp_seq=25274 ttl=58 time=24.585 ms
[2020-05-01 20:20:59.197129] 64 bytes from 1.1.1.1: icmp_seq=25275 ttl=58 time=16.486 ms
[2020-05-01 20:20:59.407113] 64 bytes from 1.1.1.1: icmp_seq=25276 ttl=58 time=13.716 ms
[2020-05-01 20:20:59.606378] 64 bytes from 1.1.1.1: icmp_seq=25277 ttl=58 time=12.004 ms
[2020-05-01 20:20:59.819949] 64 bytes from 1.1.1.1: icmp_seq=25278 ttl=58 time=15.008 ms
[2020-05-01 20:21:00.041142] 64 bytes from 1.1.1.1: icmp_seq=25279 ttl=58 time=23.277 ms
[2020-05-01 20:21:00.237453] 64 bytes from 1.1.1.1: icmp_seq=25280 ttl=58 time=13.027 ms
[2020-05-01 20:21:00.449523] 64 bytes from 1.1.1.1: icmp_seq=25281 ttl=58 time=12.516 ms
[2020-05-01 20:21:00.671200] 64 bytes from 1.1.1.1: icmp_seq=25282 ttl=58 time=21.551 ms
[2020-05-01 20:21:00.874522] 64 bytes from 1.1.1.1: icmp_seq=25283 ttl=58 time=13.614 ms
[2020-05-01 20:21:01.086029] 64 bytes from 1.1.1.1: icmp_seq=25284 ttl=58 time=13.271 ms
[2020-05-01 20:21:01.293286] 64 bytes from 1.1.1.1: icmp_seq=25285 ttl=58 time=13.487 ms
[2020-05-01 20:21:01.505726] 64 bytes from 1.1.1.1: icmp_seq=25286 ttl=58 time=13.929 ms
[2020-05-01 20:21:01.717664] 64 bytes from 1.1.1.1: icmp_seq=25287 ttl=58 time=14.158 ms
[2020-05-01 20:21:01.929438] 64 bytes from 1.1.1.1: icmp_seq=25288 ttl=58 time=13.817 ms
[2020-05-01 20:21:02.143920] 64 bytes from 1.1.1.1: icmp_seq=25289 ttl=58 time=16.105 ms
[2020-05-01 20:21:02.366423] 64 bytes from 1.1.1.1: icmp_seq=25290 ttl=58 time=30.498 ms

As you can see the drift appears to be increasing linearly. Also this is only with 5Hz data input, I'm sure it would be far worse at 50Hz.

tbird20d commented 4 years ago

Wow, a 2.6 second gap in the log files! I changed the code, and the time of the rotations should now be much more consistent. However, the log rotation itself seems to be taking about 2 seconds. Can you try again, and let me know the numbers? Also, can you try a run where you use '-t' instead of '-T -F "%F %T.%f"'? I'm a bit worried about the overhead of the strftime format when recording the line timestamp.

janipewter commented 4 years ago

Look again closely, these are not adjacent log files :-) I'm just giving an example of how much the drift has increased after 15 minutes of logging. I would expect to see HH:MM:00 at the beginning of each file and HH:MM:59 at the end of each file. But as you can see some files are ending with the first few lines from the following minute, and not ending with the final line of the current minute.

I don't think there are any gaps in the logging, the lines are just spilling over between files and the amount of lines spilling over is increasing linearly the longer the logging continues.

Edit:

I am now running a more controlled test. Rather than the ping, I have a simple while loop which is just echoing a line of sample data (similar to what one of the instruments gives out, although all values are static) followed by a 200ms sleep. I'll leave this running overnight with -t and -R 60 and see if there has been any drift in the morning.

tbird20d commented 4 years ago

Great. Thanks! I did my own testing with something running at about 160 HZ, and it did OK in short runs. Note that the rotation check is per-character, so lines are split between logs.

tbird20d commented 4 years ago

I can adjust that if needed though.

janipewter commented 4 years ago

OK, here we are. 17 or so hours of continuous logging of 5Hz data, with -t -R 60 and the filename as YYYY-MM-DD_HH:MM:SS_ttyUSB1.txt, annoyingly I forgot Windows doesn't like colons in filenames but you can still make them out.

Notice there is considerable drift in the file rotation, by 07:45 there is a full minute of drift and the log files go from 07:44:59 to 07:46:00. When I stopped the logging at 17:09 you can see it has almost hit another full minute of drift.

I also was not aware that the rotation was per character either - this is a problem for me - I need to do it by line. If the first character comes in at 12:05:59.999999 I need the entire line to be in the 12:05 file.

serial.zip

tbird20d commented 4 years ago

OK - I have modified grabserial so that by default it only rotates the logs on a newline boundary. There is a flag to override this behavior (-z or --split-lines), but I think it will be little-used, as I suspect most people wanting to rotate logs will want them rotated will full lines.

With regard to the accuracy - are you sure you are using the latest code? I have been doing some testing here, and even with the latencies caused by splitting the logs only on newline boundaries, I've been running a test and seen no more than 10 milliseconds variation in log rotation start times (which you can see down to microsecond resolution with if you use -v) over a 6-hour period. I updated the version number to 2.1.1. Please make sure you are using that one. (use: 'grabserial --version' to check)

Let me know if you are using 2.1.1 and still seeing drift.

janipewter commented 4 years ago

Updated to 2.1.1 and running a test now. Will report back later with results.

Thanks

janipewter commented 4 years ago

OK, this seems to be a big improvement. I ran it for a while with ~10Hz serial data using -t -R 60 and the rotation seemed very accurate. I did notice that very rarely it was rotating a split second too late, and catching the very first line of the next minute at the end of the file, but I can live with that.

Now I'm testing it using -T -F "%F %T.%f" -R 60 and it seems very accurate too - the only problem I have is actually starting it accurately. At the moment I have a simple bash script which will wait until the start of the next minute, but it evaluates the time until the next minute at whole second accuracy, so it's pretty hit and miss whether it starts at the beginning of the second or not. If it starts towards the end of the second, I will end up with quite a few lines spilling over between files when it rotates, especially when it's logging stuff at 50Hz.

Do you have any suggestion how to get around this? I think I am almost there!

tbird20d commented 4 years ago

I have some ideas. Note that since the rotation is precise, you should get exactly one time period's worth of data in each file, even if the rotation times don't align with wall clock times. However, if you really want the data to align with wall times, then there are a few things you can do. One is to use a script with sub-second sleep precision to start grabserial. GNU sleep actually takes sub-second values for time. Try this: 'for i in $(seq 10) ; do echo $i ; sleep 0.1 ; done | timeit' (timeit is another tool from the grabserial repository that does timing for standard input). You would have to measure grabserial startup time and compensate for that (e.g. starting grabserial 43 milliseconds before the hour, or something like that). this might get you to within a few milliseconds of the correct time, but it would still be subject to things like machine load when the programs were started.

Another option would be to round the rotate_time in grabserial internally to some round number of seconds, minutes or hours, and spin internally until that first rotate time before starting the main loop. I don't think this would be very hard, but it's a very niche feature so I have some hesitation to add it. Instead of adding yet another command line option, I might do something like add a character to the rotation time, to indicate rounding. Something like : '-R 10s' instead of just '-R 10', to indicate rounding to the nearest second. Let me know what you think.

janipewter commented 4 years ago

Instead of adding yet another command line option, I might do something like add a character to the rotation time, to indicate rounding. Something like : '-R 10s' instead of just '-R 10', to indicate rounding to the nearest second. Let me know what you think.

This sounds like it would work perfectly. In production, I will be rotating the log files hourly, and I'd have the file naming as %Y-%m-%d_%H and use the append switch in grabserial, so it wouldn't matter if the logging didn't start perfectly at the beginning of the hour as it'd just append to the current hour file, and then split automatically at the start of the next hour. That's exactly what I need.

tbird20d commented 4 years ago

OK - please try version 2.1.2 of grabserial, and let me know if it does what you need.

janipewter commented 4 years ago

Thanks for the update, I've been testing it all day and this is awesome, it's exactly what I need and it seems to work very well. Sorry to keep requesting features, but is it possible when using for example -R 1m to have grabserial start logging immediately when started rather than waiting for the start of the next minute? At the moment it waits until the start of the next minute before it will begin logging, even when using -A.

I also noticed that each 1 minute file ends with a single entry from the start of the next minute. Example:

Last 5 lines of 2020-05-15_15:42:00_ttyUSB0.txt

[2020-05-15 15:42:59.590032] Q,-000.478,-006.135,-000.927,M,+015.06,00,02
[2020-05-15 15:42:59.701032] Q,-000.478,-006.135,-000.927,M,+015.06,00,02
[2020-05-15 15:42:59.812033] Q,-000.478,-006.135,-000.927,M,+015.06,00,02
[2020-05-15 15:42:59.923034] Q,-000.478,-006.135,-000.927,M,+015.06,00,02
[2020-05-15 15:43:00.033996] Q,-000.478,-006.135,-000.927,M,+015.06,00,02

Is this a system timing issue? Either way it's not a major issue, our current logging solution also does this so we have built most of our processing code around it (mostly just dropping the last line of each file).

Thanks again for all your work.

tbird20d commented 4 years ago

OK - I took out the delayed starting. Please try version 2.1.3 and let me know if there are any remaining issues.

tbird20d / grabserial

Split files on start of hour #41