Feature: optionally include first record in data payload

I am looking into the possibility that dbdreader might be skipping the first data record. I am constructing a test package which I will share soon of 9 pairs of science and glider data segments. I will also share code which I am using to demonstrate the issue at hand.

In a nutshell, using the slocum linux binaries in converting a sdb and tbd file and then performing a merge reveals that the first record might be missing from the dbdreader decoded messages.

Here are the first couple of records as decoded using the slocum linux binaries:

===> unit_507-2022-078-0-0
       m_present_time m_altitude
3    1647787992.10013    52.7949
18   1647788315.34152    52.7949
206   1647789281.2258    94.1087
238  1647789440.33847    71.7375

Here are the first couple of records from dbdreader:

1647788315.3415222 52.79487228393555
1647789281.2257996 94.1086654663086
1647789440.3384705 71.73748779296875
1647789559.9006653 56.38461685180664
1647789649.5844421 46.008548736572266

All as is should be, starting with the 2nd record of the slocum binary decoder (1st record of the dbdreader).

I will post a link soon to the dataset, cache files and code that I am working with. The general pattern seems to persist with any pair of science and glider files, so you should be able to reproduce with existing sample data.

toasc="/home/cermak/src/GUTILS/gutils/slocum/bin/dbd2asc"
merge="/home/cermak/src/GUTILS/gutils/slocum/bin/dba_merge"
cnfig="config"
bndir="binary"

${toasc} -o -c ${cnfig} ${bndir}/unit_507-2022-078-0-0.sbd > ascii/0-0-sbd.dba
${toasc} -o -c ${cnfig} ${bndir}/unit_507-2022-078-0-0.tbd > ascii/0-0-tbd.dba
${merge} ascii/0-0-sbd.dba ascii/0-0-tbd.dba > ascii/unit_507_2022_078_0_0_sbd.dat

The ultimate goal is to allow us replace the slocum binary tools with this library in its entirety. I just noticed this issue at this point and want to see if it was an issue or there might be a reason the first record might be skipped?

Sample datasets: https://nasfish.fish.washington.edu/echotools/datasets/dbdreader/dbdreader_20230430.zip

I believe the glider in question is a Slocum G3. Will continue to investigate as well. This is a useful tool for climbing around the binary data looking at the data records: https://hexed.it/

Results from python supplied script:

$ python ./cmpData.py 

===> ascii/unit_507_2022_078_0_0_sbd.dat
===> unit_507-2022-078-0-0
Result from merge_dba:
       m_present_time m_altitude
3    1647787992.10013    52.7949
18   1647788315.34152    52.7949
206   1647789281.2258    94.1087
238  1647789440.33847    71.7375
263  1647789559.90067    56.3846
282  1647789649.58444    46.0085
300  1647789722.63565    37.4689
312  1647789778.15817    31.6313
322  1647789821.28613    27.6911
328  1647789855.35016    24.5446
334  1647789885.24399    21.6618
338  1647789906.44281    19.8181
693  1647793734.86829    94.3285
706  1647793897.18198    72.9194
715  1647794024.84744    58.9133
724  1647794123.18723    48.5177
733  1647794200.28311    40.5177
739  1647794260.27444    34.4115
745  1647794307.54941    28.9792
750  1647794346.17276    24.9158
754  1647794376.31052    21.8657
758    1647794398.061    20.1404

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude
1647788315.3415222   52.7948723
1647789281.2257996   94.1086655
1647789440.3384705   71.7374878
1647789559.9006653   56.3846169
1647789649.5844421   46.0085487
1647789722.6356506   37.4688644
1647789778.1581726   31.6312580
1647789821.2861328   27.6910858
1647789855.3501587   24.5445671
1647789885.2439880   21.6617832
1647789906.4428101   19.8180714
1647793734.8682861   94.3284531
1647793897.1819763   72.9194107
1647794024.8474426   58.9133072
1647794123.1872253   48.5177040
1647794200.2831116   40.5177040
1647794260.2744446   34.4114761
1647794307.5494080   28.9792423
1647794346.1727600   24.9157505
1647794376.3105164   21.8656902
1647794398.0610046   20.1404152

===> ascii/unit_507_2022_078_0_1_sbd.dat
===> unit_507-2022-078-0-1
Result from merge_dba:
     m_present_time m_altitude
3  1647797262.02112    20.1404

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude

===> ascii/unit_507_2022_078_0_2_sbd.dat
===> unit_507-2022-078-0-2
Result from merge_dba:
       m_present_time m_altitude
3    1647797523.63519    20.1404
269   1647799166.7489    93.7827
305  1647799337.60144    74.7497
332  1647799470.16937    61.5226
355  1647799577.30692    50.7631
375  1647799662.93506    43.1514
391  1647799732.08231    37.5751
405  1647799788.09076    32.1587
417  1647799835.50098    28.0134
426  1647799873.98682    24.8877
433  1647799904.10349    22.5629
439  1647799925.58719    21.2222
447   1647799947.7655    19.8462
802  1647803534.71719    13.6276

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude
1647799166.7489014   93.7826614
1647799337.6014404   74.7496948
1647799470.1693726   61.5225868
1647799577.3069153   50.7631264
1647799662.9350586   43.1514053
1647799732.0823059   37.5750923
1647799788.0907593   32.1587296
1647799835.5009766   28.0134315
1647799873.9868164   24.8876686
1647799904.1034851   22.5628815
1647799925.5871887   21.2222214
1647799947.7655029   19.8461533
1647803534.7171936   13.6275949

===> ascii/unit_507_2022_078_0_3_sbd.dat
===> unit_507-2022-078-0-3
Result from merge_dba:
     m_present_time m_altitude
3  1647805656.99963    13.6276

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude

===> ascii/unit_507_2022_078_0_4_sbd.dat
===> unit_507-2022-078-0-4
Result from merge_dba:
       m_present_time m_altitude
3     1647805913.8418    13.6276
310  1647807770.36847    94.5678
347  1647807945.22488    77.7473
378  1647808090.59744    65.5983
405  1647808210.89642    55.3773
426   1647808308.9747    47.5531
444  1647808390.28259    42.4994
459  1647808458.91168    38.1319
474   1647808519.4924    33.5287
483  1647808566.51901    28.5751
491  1647808605.02762    24.6654
499  1647808635.14502    21.4286
504  1647808656.79953    19.3919
812  1647811881.11328    94.1331
827  1647812051.53488    73.4225
838  1647812179.75671      59.16
846  1647812277.96017    48.0134
855   1647812355.0513    39.6093
862  1647812415.10129    33.8181
868  1647812462.21277    30.1575
874   1647812505.2868    26.9585
878  1647812539.61273    24.2271
883  1647812569.69446    21.7875
886  1647812591.33939    20.3626
889  1647812613.59735    19.6349

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude
1647807770.3684692   94.5677643
1647807945.2248840   77.7472534
1647808090.5974426   65.5982895
1647808210.8964233   55.3772888
1647808308.9747009   47.5531120
1647808390.2825928   42.4993896
1647808458.9116821   38.1318665
1647808519.4924011   33.5286942
1647808566.5190125   28.5750923
1647808605.0276184   24.6654453
1647808635.1450195   21.4285717
1647808656.7995300   19.3919411
1647811881.1132812   94.1330872
1647812051.5348816   73.4224701
1647812179.7567139   59.1599503
1647812277.9601746   48.0134315
1647812355.0513000   39.6092796
1647812415.1012878   33.8180695
1647812462.2127686   30.1575089
1647812505.2868042   26.9584866
1647812539.6127319   24.2271061
1647812569.6944580   21.7875462
1647812591.3393860   20.3626366
1647812613.5973511   19.6349201

===> ascii/unit_507_2022_078_0_5_sbd.dat
===> unit_507-2022-078-0-5
Result from merge_dba:
     m_present_time m_altitude
3  1647815396.73434    19.6349

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude

===> ascii/unit_507_2022_078_0_6_sbd.dat
===> unit_507-2022-078-0-6
Result from merge_dba:
       m_present_time m_altitude
3    1647815654.83646    19.6349
286  1647817406.83234    94.8938
321  1647817581.85333    79.1966
350  1647817726.98529    65.4371
372  1647817846.62964     55.663
392  1647817940.35114    47.6337
409  1647818021.61417    41.8694
426  1647818098.89392     36.707
437  1647818154.73715     31.779
446  1647818197.41885    27.6923
454  1647818232.05643    24.5665
460  1647818261.78415     21.293
464  1647818283.13773     19.873
783  1647821757.74164    94.6728
799   1647821928.4733    74.3297
811  1647822061.01889    60.4823
824  1647822168.23782     51.022
832  1647822253.58789    43.9597
840  1647822326.47488    38.0354
846  1647822386.16583    33.2357
852   1647822433.4245    29.5238
856  1647822471.92181     26.453
860  1647822506.22208    23.9096
864  1647822531.99493    21.9512
866  1647822553.21835    20.9048
869  1647822575.05814    20.3529

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude
1647817406.8323364   94.8937759
1647817581.8533325   79.1965790
1647817726.9852905   65.4371185
1647817846.6296387   55.6630020
1647817940.3511353   47.6337013
1647818021.6141663   41.8693542
1647818098.8939209   36.7069588
1647818154.7371521   31.7789993
1647818197.4188538   27.6923084
1647818232.0564270   24.5665455
1647818261.7841492   21.2930412
1647818283.1377258   19.8730164
1647821757.7416382   94.6727753
1647821928.4732971   74.3296738
1647822061.0188904   60.4822960
1647822168.2378235   51.0219765
1647822253.5878906   43.9597054
1647822326.4748840   38.0354080
1647822386.1658325   33.2356529
1647822433.4244995   29.5238094
1647822471.9218140   26.4529915
1647822506.2220764   23.9096451
1647822531.9949341   21.9511604
1647822553.2183533   20.9047623
1647822575.0581360   20.3528690

===> ascii/unit_507_2022_078_0_7_sbd.dat
===> unit_507-2022-078-0-7
Result from merge_dba:
     m_present_time m_altitude
3  1647825477.68988    20.3529

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude

===> ascii/unit_507_2022_078_0_8_sbd.dat
===> unit_507-2022-078-0-8
Result from merge_dba:
       m_present_time m_altitude
3    1647825745.16934    20.3529
286  1647827447.84875    94.2759
319  1647827618.26187    75.7827
347  1647827755.00137    62.0159
371    1647827866.345    52.6691
391  1647827956.66812    45.6886
408  1647828033.78882    39.8816
424  1647828098.60324    35.4396
437  1647828154.23065    31.1795
449  1647828197.57965    28.1831
459  1647828236.18674    25.4481
467  1647828266.46698    23.3175
474  1647828292.55502     21.409
480  1647828314.03809    20.0989
860   1647832184.2858     94.221
876  1647832359.24014    76.7314
890  1647832500.42477    64.0843
900   1647832615.9827    54.8486
909  1647832714.33966    48.5299
917  1647832795.50861    42.7045
925  1647832864.11362     37.917
931  1647832923.93713    33.7045
938  1647832975.67633    30.1197
945  1647833018.73175    26.3724
949  1647833053.03363     23.851
952  1647833078.86322    22.1734
956  1647833100.62231    21.1795
959  1647833122.40887    19.6838

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude
1647827447.8487549   94.2759476
1647827618.2618713   75.7826614
1647827755.0013733   62.0158730
1647827866.3450012   52.6691093
1647827956.6681213   45.6886444
1647828033.7888184   39.8815613
1647828098.6032410   35.4395599
1647828154.2306519   31.1794872
1647828197.5796509   28.1831493
1647828236.1867371   25.4481068
1647828266.4669800   23.3174610
1647828292.5550232   21.4090347
1647828314.0380859   20.0989017
1647832184.2857971   94.2210007
1647832359.2401428   76.7313766
1647832500.4247742   64.0842514
1647832615.9826965   54.8485947
1647832714.3396606   48.5299149
1647832795.5086060   42.7045174
1647832864.1136169   37.9169731
1647832923.9371338   33.7045174
1647832975.6763306   30.1196575
1647833018.7317505   26.3724060
1647833053.0336304   23.8510380
1647833078.8632202   22.1733818
1647833100.6223145   21.1794872
1647833122.4088745   19.6837616

===> ascii/unit_507_2022_078_0_9_sbd.dat
===> unit_507-2022-078-0-9
Result from merge_dba:
     m_present_time m_altitude
3  1647835760.89847    19.6838

Result from dbdreader, get_list m_altitude:
m_present_time       m_altitude

Did some tracing into the C code, I am wondering if there is an initial mismatch:

static unsigned char read_known_cycle(FILE *fd)
{
  // the first 2 bytes are:
  // s                  Cycle Tag (this is an ASCII s char).
  // a                  One byte integer.
  // but just skip over them
  int pos = ftell(fd);
  fseek(fd, pos + 2, 0);

  // followed by, the value we want to check for:
  // 0x1234             Two byte integer.
  // which is 4660
  unsigned short two_byte_int;
  fread((void*)(&two_byte_int), sizeof(two_byte_int), 1, fd);
  //printf("two_byte_int : %d\n", two_byte_int);

  // the next 12 bytes are:
  //     123.456            Four byte float.
  //     123456789.12345    Eight byte double.
  // but by this point we already know the byte order, so just skip the bytes
  pos = ftell(fd);
  fseek(fd, pos + 13, 0);

The first three items are correct, detect and skip over the first two bytes as the start of the cycle head sa and the next two bytes help determine word order. This last part says then skip 12 bytes, but the skip is pos + 13 not what I expect to be pos + 14 (12 bytes to skip + two byte integer)?

That is as far as I have traced at the moment. Using the binary snooper I can clearly see the first record of data in the sbd that is currently skipped.

Hi @jr3cermak,

Thank you for doing such a thorough investigation and providing the test data and scripts. Yes, dbdreader skips the first line of data. On purpose.

What I noticed when starting coding dbdreader a long time ago, is that all parameters are set to as "UPDATED" in the first state bytes section of each file. I also noticed that usually the second entry was some considerable time later. My assessment is that it is very unlikely that all parameters are measured at the time of file creation. Nevertheless, they get all published in the dbd file, and I suspect they will take whatever value is in memory. Either these values are nonsense, or were measured some (long) time before. In either case these data points have no scientific value. So the first data line is skipped over.

As for the number of bytes to skip being 12, 13 or 14, 13 is the correct number. In earlier versions of dbdreader, I would skip 17 bytes, which I found out by trial and error, as only then the decoding of the state bytes would make sense. Also these bytes would always be the same. The Glider manual which described the data format, did not mention to skip these bytes, though. Later, with the arrival of G3 gliders, that use little endian byte order as opposed to the big endian of persistor based gliders, @erinaceous used these bytes to test the byte order, since these 17 bytes are composed of 12 34 as two bytes, and 4 bytes representing a float 123.456 and 8 bytes representing a double 123456789.12345. The next byte is always 'd' or 64 in hex. This makes the 17 bytes that I needed to skip initially.

So in summary, I don't think that this is an issue, unless it is your opinion that you require the first data line as well, even though it does not contain any useful information.

Thank you @smerckel for the confirmation that the first record is intentionally skipped at the moment.

Let me do some checking with my upstream processing to see what is typically done with the first record.

There might be a desire to have two additional options: (1) allow passing all values to reproduce the original slocum binary behavior; (2) at least pass along the timestamp with the remaining requested columns filled with nan or missing values.

@smerckel I disagree that the first line never has value and am of the opinion that it should be a optional choice to keep or drop the first line, but default to dropping. The initialization line is useful for diagnostic purposes more so than science purposes. I think it makes sense to almost always drop the first line of the science files, as pretty much all initialization values for the instruments are zero (although an exception may be something like the card data space used). However the flight files have some use of reading the initialization values, especially if someone ever needs to review data from only 1 segment (therefore only 1 dbd/sbd file) for diagnostics. Many sensors/variables only have a value in that first initialization line (e.g. m_science_on, m_why_started, u_alpha_system_clock_lags_gps, etc.) and not again for the rest of the file. Therefore it may be useful to see what values certain variables initialize with before starting the segment. I would strongly encourage you to make it an option to keep the initialization line.

As for the first 17 bytes, according to Dave Pingal's (of TWR) original binary reading python package (he called pyslocum, but I don't think it was ever published anywhere), he used the first 17 bytes (or 16 plus a tag) to determine the endian-ness of the binary file. I believe this was specifically helpful for G3 vs G2 data across different platforms. Here is his check_binary method that checks the endian-ness from a dbdfile class:

    def check_binary(self):
        test_pat = self.ifile.read(16)
        (tag, byte1, byte2, byte4, byte8) = struct.unpack('>cchfd', test_pat)
        (_, _, byte2, _, _) = struct.unpack('<cchfd', test_pat)
        if byte2 == 4660:
            endian = '<'
        else:
            endian = '>'
        return endian

and yes, I am aware that he has a similar line twice that overwrites byte2.

@s-pearce , thanks for the explanation. I can see the point of having the ability to extract the first line too. I have coded that now, and you can set the behaviour by the class variable SKIP_INITIAL_LINE for the DBD class. In this way, you can set the behaviour of DBD and MultiDBD by setting one variable once, and affects all future calls to the get*() methods.

The change is applied to the master branch for now. I made some other changes to bring uniformity in the return values of the get* methods. After updating the manual, I think the code can then be released as version 0.5.0.

Thank you @jr3cermak, @s-pearce for the feedback.

Ooh, could I recommend you make that an instance variable rather than a class-level variable?

example: Using dbdreader in multi-threaded code which is handling reading slocum data as it comes off of multiple different gliders, so getting .tbd and .sbd out of order and/or instantiating MultiDBDs for different gliders concurrently. I could see wanting to enable getting the initial line for the extra engineering info on one glider but not another, in parallel.

I think I would agree with @erinaceous. I just tried out the new MultiDBD with two instances of the same set of sbd files. Changing the SKIP_INITIAL_LINE variable in between creation of the 2 instances changes both instances to use the new value of SKIP_INITIAL_LINE, unless I've already read out the variables I want from the first instances. Presumably because the instances exists as a buffered object. I think an instance variable might prevent this potentially unexpected behavior.

set_a = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR)
DBD.SKIP_INITIAL_LINE = False
set_b = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR)
ta, deptha = set_a.get("m_depth")
tb, depthb = set_b.get("m_depth")
len(ta) == len(tb)
Out[93]: True

but if I assign the variables before changing SKIP_INITIAL_LINE:

set_a = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR)
ta, deptha = set_a.get("m_depth")
DBD.SKIP_INITIAL_LINE = False
set_b = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR)
tb, depthb = set_b.get("m_depth")
len(ta) == len(tb)
Out[107]: False

Also thanks for adding in these feature. It is much appreciated.

@s-pearce : the behaviour you describe was actually intentional. Setting DBD.SKIP_INITIAL_LINE sets how the DBD instance will treat the initial lines of the binary files to be read (until its value is changed again). I considered this more as a policy: set once, preferably at the top, and then process the dbd files.

I can see the point of @erinaceous too, although it is a bit hypothetical. Making SKIP_INITIAL_LINE an instance variable means that there is more fine grained control on how the dbd files are read, at the expense of more coding on the user's side.

I could add a keyword skip_initial_line to both the DBD and MultiDBD class constructors, which then sets the behaviour for all get() methods invoked for this instance. You could then also directly set the attribute during the life time of the instance. I would prefer this over making skip_initial_line a keyword to all get() methods.

The example above would then be something like:

set_a = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR)

set_b = MultiDBD("ce_382-2022-024-4-[0-7].sbd", cacheDir=LCACHEDIR, skip_initial_line=False)

ta, deptha = set_a.get("m_depth")
tb, depthb = set_b.get("m_depth")

len(ta) == len(tb) # => False

Perhaps one of you guys may have a better alternative for the keyword name?

drop_first_data_entry
skip_initial_data_point
...

Because the individual DBD instances are created in the constructor of MultiDBD, changing the behaviour of each DBD requires a new method to MultiDBD, something like set_skip_initial_line(boolean value).

The commit 9c0c949 is working perfectly for our needs.

smerckel / dbdreader

Feature: optionally include first record in data payload #18