tomeichlersmith / hps

0 stars 0 forks source link

Occasionally Corrupted mille data files #12

Closed tomeichlersmith closed 1 year ago

tomeichlersmith commented 1 year ago

Noticed while trying to read a large (200) batch of mille data files. Two of them were corrupted and had an "unexpected EoF" error.

  Record      2200000
 PEREAD: file          121 read the first time, found       19491  records
readC: problem with stream or EOF skipping doubles
 Read error for binary Cfile         122 record       16443 :         -32
STOP PEREAD: stopping due to read errors

Going to use this issue to keep notes on why this is happening.

tomeichlersmith commented 1 year ago

PF has only seen this when there are low quality tracks - make sure high quality tracks.

Look into dumping contents of binary. Isolate it and study the error in more detail.

tomeichlersmith commented 1 year ago

Out of the 200 partitions in a ideal FEE MC run, there are 4 that I had to delete due to this error.

ls /sdf/group/hps/mc/4pt55GeV/fee/idealCond/fee_recon_20um120nA*slcio > 2019-fee-mc-ideal-conditions.list

The four are below.

Partition hps-mc job index
101 2
128 31
156 62
164 71
tomeichlersmith commented 1 year ago

I copied down the readMilleBinary.py script and uncommented out the printout when an End-of-File exception is thrown. None of the four data files produce an EoF excpetion, but their text outputs are all smaller than the original binary files. This makes me think the readMilleBinary.py is not doing the same thing as the reading fortran code and is instead quietly exiting early when a corrupted event is reached.

Nevermind I see the same behavior with a nominally uncorrupted data file as well.

tomeichlersmith commented 1 year ago

The seemingly newer readMilleBinary.py inside the MillepedeII source is more robust and is actually display a difference between the corrupted and uncorrupted data files.

An uncorrupted data file is expanded into a O(4) times larger txt file. It ends with a simple status message showing that is closed the file after successfully reading a certain number of records.

==> fee_recon_20um120nA_1_mille.bin.txt <==
 local   [1, 24, 26, 28]
 local   [0.03629401698708534, 0.1352861225605011, -0.13969768583774567, 0.00441156467422843]
 -l- meas.  39 27 3 0 0.0 0.000150372943608
 local   [27, 29, 31]
 local   [0.004409731365740299, -0.13678894937038422, 0.1323792189359665]
 -l- meas.  40 1 4 0 0.0 0.000150435473188
 local   [1, 26, 28, 30]
 local   [0.03631870448589325, 0.00441156467422843, -0.13684581220149994, 0.13243424892425537]

 end of file after 19280 records

The corrupted data files are expanded into larger txt files but also display errors similar to the ones seen by readC within pede:

101

==> fee_recon_20um120nA_101_mille.bin.txt <==
 local   [1, 24, 26, 28]
 local   [0.03622275963425636, 0.13570138812065125, -0.14012131094932556, 0.004419930279254913]
 -l- meas.  39 27 3 0 0.0 0.000145516503835
 local   [27, 29, 31]
 local   [0.004418243188410997, -0.13633759319782257, 0.13191935420036316]
 -l- meas.  40 1 4 0 0.0 0.000145572077599
 local   [1, 26, 28, 30]
 local   [0.03625506907701492, 0.004419930279254913, -0.13638968765735626, 0.13196974992752075]

 >>> error: end of file before end of record 16443

128

==> fee_recon_20um120nA_128_mille.bin.txt <==
 local   [1, 24, 26, 28]
 local   [0.036268964409828186, 0.1360529065132141, -0.14046941697597504, 0.004416505806148052]
 -l- meas.  37 27 3 0 0.0 0.000148618026287
 local   [27, 29, 31]
 local   [0.004412148147821426, -0.135076105594635, 0.1306639462709427]
 -l- meas.  38 1 4 0 0.0 0.000148764796904
 local   [1, 26, 28, 30]
 local   [0.036314819008111954, 0.004416505806148052, -0.13520939648151398, 0.13079288601875305]

 >>> error: end of file before end of record 14477

156 and 164 don't have a helper error message but they do have records that have no measurements right before the script errors out. Not sure if related.

156

==> fee_recon_20um120nA_156_mille.bin.txt <==
 local   [0.010802353732287884, -0.14640052616596222, 0.13559816777706146]
 -l- meas.  34 1 4 0 0.0 0.000138482966577
 local   [1, 14, 16, 18]
 local   [0.015491077676415443, 0.010803863406181335, -0.14642095565795898, 0.13561709225177765]
 === NR  14733 0
 === NR  14734 8
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]
 -l- meas.  2 1 1 0 2.52233723578e-44
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 152, in <module>
    print ' -l- meas. ', nh, inder[ja + 1], jb - ja - 1, i - jb, glder[ja], glder[jb]
IndexError: array index out of range

164

==> fee_recon_20um120nA_164_mille.bin.txt <==
 local   [0.004621860571205616, -0.1373072862625122, 0.13268543779850006]
 -l- meas.  40 1 4 0 0.0 0.000135057009174
 local   [1, 26, 28, 30]
 local   [0.03470281884074211, 0.004624072927981615, -0.13737303018569946, 0.13274896144866943]
 === NR  16400 0
 === NR  16401 8
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]
 -l- meas.  2 1 1 0 2.52233723578e-44
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 152, in <module>
    print ' -l- meas. ', nh, inder[ja + 1], jb - ja - 1, i - jb, glder[ja], glder[jb]
IndexError: array index out of range
tomeichlersmith commented 1 year ago

I dug up the original files produced by the hps-java run during batch processing for partitions 101 and 128 and those files are not corrupted. This is similar to what I've seen at UMN with HDFS - sometimes a copy from one filesystem to another fails or ends early due to a network hiccup. We can prevent this issue by making the cp extra safe and checking that the destination file is identical to the original after the copy supposedly completes.

A safe-cp implemented in bash I wrote awhile ago: https://gitlab.com/tbeichlersmith/config/-/blob/main/.bash_aliases.d/safe.sh

tomeichlersmith commented 1 year ago

The error logs from the hps-java jobs that produced partitions 156 and 164 have a warning that is not in most of the other error logs from the jobs without corrupted bin files.

org.hps.recon.tracking.kalman.StateVector:WARNING StateVector:smooth, inversion of the covariance matrix failed

I can still find it in non-corrupted job error logs, so this isn't a silver bullet.

tomeichlersmith commented 1 year ago

Just to double check, I made sure that partitions 156 and 164 did not suffer from the same scratch copying error that partitions 101 and 128 did. The scratch and output directory data files for partitions 156 and 164 are byte-wise identical and both produce the same error when parsed by the readMilleBinary.py script.

tomeichlersmith commented 1 year ago

I've updated the readMilleBinary.py script to give a little more information about the error we are seeing.

156

 === NR  14733 0
 === NR  14734 8
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]

 >>> error: badly formatted record
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 157, in <module>
    glder_vb = glder[jb]
IndexError: array index out of range

164

 === NR  16400 0
 === NR  16401 8
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]

 >>> error: badly formatted record
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 157, in <module>
    glder_vb = glder[jb]
IndexError: array index out of range

Both end up with the glder array being shorter than the format expects. I'm guessing glder is short for "global derivatives".

tomeichlersmith commented 1 year ago

Mille CFile Record

A single track record as deduced from the mille binary python reading script.

First 32-bit word is "length" where the highest bit is 0 signaling float (32-bit) measurements were used and is a 1 signaling double (64-bit) measurements. The lower 31 bits is nr which I think is the number of measurements.

Then we read an array of length nr of float (or double) measurmenets into glder.

Then an array of length nr of 32-bit integers into inder.

tomeichlersmith commented 1 year ago

Adding even more detail to the printouts makes me even more confident that the 156 and 164 files are generated by the same bug.

Every single value in the record after the record with zero measurements are the same between the two partitions. I'm guessing this is because they are the same bits and are just be mis-interpreted since there shouldn't be zero-measurement-records.

156

 === NR  14733
f loc: 51694624
 nr = 0
len(glder) = 0
len(inder) = 0
f loc: 51694628
 === NR  14734
f loc: 51694628
 nr = 8
len(glder) = 8
len(inder) = 8
f loc: 51694696
nh = 1 i = 5 jb = 5 ja = 1
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]
nh = 2 i = 8 jb = 8 ja = 6

 >>> error: badly formatted record
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 173, in <module>
    glder_vb = glder[jb]
IndexError: array index out of range
length = array('i', [17])
glder = array('f', [2.6624670822171524e-44, 2.942726775082116e-44, 0.0, 0.0, 1.401298464324817e-45, 2.2420775429197073e-44, 2.5223372357846707e-44, 2.802596928649634e-44])
inder = array('i', [0, 0, 19, 21, 23, 0, 0, 1])

164

 === NR  16400
f loc: 57543132
 nr = 0
len(glder) = 0
len(inder) = 0
f loc: 57543136
 === NR  16401
f loc: 57543136
 nr = 8
len(glder) = 8
len(inder) = 8
f loc: 57543204
nh = 1 i = 5 jb = 5 ja = 1
 -l- meas.  1 19 3 0 2.94272677508e-44 2.24207754292e-44
 local   [19, 21, 23]
 local   [0.0, 0.0, 1.401298464324817e-45]
nh = 2 i = 8 jb = 8 ja = 6

 >>> error: badly formatted record
Traceback (most recent call last):
  File "MillepedeII/tools/readMilleBinary.py", line 173, in <module>
    glder_vb = glder[jb]
IndexError: array index out of range
length = array('i', [17])
glder = array('f', [2.6624670822171524e-44, 2.942726775082116e-44, 0.0, 0.0, 1.401298464324817e-45, 2.2420775429197073e-44, 2.5223372357846707e-44, 2.802596928649634e-44])
inder = array('i', [0, 0, 19, 21, 23, 0, 0, 1])
tomeichlersmith commented 1 year ago

Rerunning the java step manually for partitions 156 and 164 did not reproduce the corrupted binary file.

I am guessing this means that the corruption of the binary file was done outside of the control of the java program and/or GBL C library. Possible explanations include:

cbravo135 commented 1 year ago

Thanks for looking into this more. I think your solution of just rerunning these partitions is reasonable, but we need a nice way of finding the corrupted files and deleting them. hps-mc has a way to only submit jobs with missing output files, so after you remove the corrupted files this could be used to easily submit the jobs that are needed.