Open jodavies opened 8 years ago
Hi Josh,
For now it does not ring a bell. Problem with retrying is that compressing/decompressing is something with memory. That means you cannot jump into the middle, and you can also not just retry. You would have to start from the ‘beginning’ again. This could be possible only if the compression would be arranged as separate entities and not as one complete object for a whole patch. Unless of course the error occurs at the start of a patch.
How often has this occurred to you?
There is a still undocumented feature (I have not yet tested it. It was built by Ali Mirsoleimani) to use BZIP2 instead of GZIP. This could give shorter output and depending of the level it could also be a little bit faster (definitely compared to GZIP n with n >6). If you try this, let me know the results.
Jos
On 20 mei 2016, at 12:23, jodavies notifications@github.com wrote:
Several times, I have seen errors like
1FillInputGZIP: Error in gzip handling of input. PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 124F80 bytes Program terminating in thread 1 at replace Line 18 --> They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).
I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?
Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/vermaseren/form/issues/95
I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.
It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.
Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).
Hi Josh,
gzip compression is only used for the sort files. In the large buffer only the standard form compression is used which is something rather simple.
Jos
On 20 mei 2016, at 12:54, jodavies notifications@github.com wrote:
I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.
It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.
Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/vermaseren/form/issues/95#issuecomment-220576360
I have also seen, twice, errors like
Time = 108303.46 sec Generated terms = 71246936
d5d90522070 Terms left = 70733861
dotrewrite-d89-0-25 Bytes used =15614237544
Read error in SetScratch
Program terminating in thread 3 at replace Line 18 -->
It is possible that these are real hardware errors. I will try to investigate whether this is the case.
Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.
I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.
Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.
* 192GB
#: MaxTermSize 300K
#: WorkSpace 400M
#: LargeSize 44G
#: SmallSize 6G
#: ScratchSize 16G
#: HideSize 8G
#: TermsInSmall 50M
#: LargePatches 16384
#: FilePatches 16384
Hi Josh,
A good setting for TermsInSmall is SmallSize/(typical term size). In your crash that is in form compressed form about 200 bytes. In reality in the small buffer the terms are not compressed and hence about 400 bytes is more realistic. 6Gbytes/50Mbytes is 120, hence there should be no problem there. It is possible that the above is a very rare error in FORM of course. I have had some of that in the past that only in some very special cases something went wrong and only when things were very big. And then, when this happens with tform, it may occur only once out of many runs. This is very hard to debug as you can imagine. The setup parameters look rather fine to me. If the ratio between LargeSize and SmallSize is not very large, you may not need so many LargePatches, unless sorting the small buffer gives consistently a very large collapse in the number of terms. This is different with the file patches. But indeed, it is important to avoid stage4, because that is never fast (I have been in stage5 only once, but it worked).
If you can find out more about these crashes, maybe we can sooner or later make them deterministic and have a chance to find the cause.
Jos
On 24 mei 2016, at 13:04, jodavies notifications@github.com wrote:
I have also seen, twice, errors like
Time = 108303.46 sec Generated terms = 71246936 d5d90522070 Terms left = 70733861 dotrewrite-d89-0-25 Bytes used =15614237544 Read error in SetScratch Program terminating in thread 3 at replace Line 18 --> It is possible that these are real hardware errors. I will try to investigate whether this is the case.
Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.
I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.
Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.
- 192GB
: MaxTermSize 300K
: WorkSpace 400M
: LargeSize 44G
: SmallSize 6G
: ScratchSize 16G
: HideSize 8G
: TermsInSmall 50M
: LargePatches 16384
: FilePatches 16384
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/vermaseren/form/issues/95#issuecomment-221236511
I have exactly the same issue (with default form settings), and since it is reproducible for me I did a bisection to track it down from a good 4.1 tag and a bad head to: " Fixed the compress/gzip/hide bug." 43a5b1ea18006d851d12d321a408d6516d2f0da6. Hope this helps!
Cheers, Tobias
@burp , are you able to provide a script for your reproducible example, if it is small and crashes quickly?
I have a calculation which has crashed the last 4 times I have run it, but it produces 2TB of scratch files and takes more than a day to crash.
It crashes within 1-2 minutes, that's why I was able to do a quick bisection. Unfortunately it is part of a bigger setup with tons of includes etc. I will try to boil it down to one simple script/file.
@burp , were you able to create a simple example which crashes like this in the end?
I have had this crash on a second machine now, making genuine disk errors less likely.
Should we also have && fi->zsp != 0
in the if statement of line 1266, sort.c, in the function Sflush ?
@joedavies: Have not had time for it yet, maybe I can just send you the setup in private. You will have to adjust a few include paths etc., but it's probably simpler for me than trying to minimize the reproducible example.
That works, if it is OK with you!
Probably it is best if you can send Jos a copy also, he is much more likely to be able to find possible bugs.
I had a quick look at this example. With GZIPDEBUG enabled at various places, during normal operation, one sees output like the following,
Preparing z-stream 0 with compression 1
...
Preparing z-stream 4 with compression 1
-+Reading 160000 bytes in stream 0 at position 0; stop at 5320619
Want to read in stream 0 at position 160000
--Reading 160000 bytes in stream 0 at position 160000
...
Closing stream 0
--Reading 160000 bytes in stream 1 at position 5480619
...
Closing stream 1
--Reading 160000 bytes in stream 2 at position 10310142
...
...
Closing stream 4
etc
In the example which crashes, we find
Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 160000 bytes in stream 0 at position 0; stop at 5486806
--Reading 160000 bytes in stream 0 at position 160000
...
Closing stream 0
-Last words: 1 1 2 -3 0
zerror = -2 in stream 0. At position 5486806
It seems like FORM is trying to read from stream 0, which was already "closed"?
The offending call of PutIn()
is the one below the label NextTerm:
in sort.c
. It seems ki
has the value 0
when it should be 1
?
This crash does not occur if one changes MaxTermSize (and thus the buffersize passed to FillInputGZIP
).
The crash scenario seems to be:
Closing stream 0
")PutIn
(from sort.c:4068
) and then FillInputGZIP
with the same stream numberif
branch in compress.c:490
else
branch in compress.c:497
. Obtain value toread = 0
.compress.c:545
. Get zerror = -2.EDIT: there are no valgrind errors
@jodavies Is there any public test code to reproduce the bug?
I am using the code from @burp , of which I think Jos also has a copy. I have not been able to reduce it to a particularly minimal crashing example.
I have made a nice example which demonstrates this crash. The script is
#-
* Try to find something which causes the gzip crash. It seems to be due to some
* tricky combination of number of patches or buffer sizes etc.
* We make an expression with an increasing number of terms until it crashes.
* Buffer settings, smaller than default. Hopefully we can hit the error with smaller expressions?
* Divide everything by 8
#: filepatches 32
#: largepatches 32
#: largesize 6250000
#: maxtermsize 1250
#: smallsize 1250000
#: smallextension 2500000
#: termsinsmall 12500
Off Statistics;
Symbol x,n;
#message terms = `NTERMS'
Local test = sum_(n,1,`NTERMS',x^n);
.sort
* Read and write terms
.sort
* Check all terms present
Identify x^n?pos_ = n;
.sort
Local test = test - `NTERMS'*(`NTERMS'+1)/2;
Print;
.end
I run this as an array task on a cluster, to scan over NTERMS. I grep the log file for "test = 0;" and keep a copy of failures.
I am using the current git version of form, with GZIPDEBUG enabled in sort.c and compress.c. These examples crash with 4.2.0 and 4.1 also. EDIT: an old FORM 4.0 binary on our network does not crash, I don't know if FORM 4.0 supported GZIP or not?
I have found the following crashing examples in the range 250K to 500K:
266906
279406
291906
304406
316906
329406
341906
354406
366906
379406
391906
404406
416906
429406
441906
454406
466906
479406
491906
(generated by 254406 + 12500 n) (same setup, maxtermsize = 1300: same as above) (same setup, termsinsmall = 25000: crashes at 254246 + 12500 n)
with output like
FORM 4.2.0 (Sep 21 2017, v4.2.0-16-g480a787-dirty) 64-bits Run: Thu Sep 21 14:31:21 2017
#-
~~~terms = 266906
MergePatches created output file /formswap/1286300.266000.moon2/xformxxx.sor
Writing 100000 bytes at 0: -81 -31 -81 -29 111
Writing 100000 bytes at 100000: -105 -16 -105 -15 87
Writing 100000 bytes at 200000: -113 -65 -128 -1 15
Writing 100000 bytes at 300000: -4 105 -4 25 -4
Writing 100000 bytes at 400000
Writing 96024 bytes at 500000
Last bytes written: 63 1 11 -116 70
Perceived position in FlushOutputGZIP is 503976
Writing 66334 bytes at 503976
Last bytes written: 0 -45 -119 81 -114
Perceived position in FlushOutputGZIP is 537642
EndSort: fPatchN = 2, lPatch = 2, position = 537642
fPatchesStop[0] = 503976
fPatchesStop[1] = 537642
Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 100000 bytes in stream 0 at position 0; stop at 503976
read: 100000 +Last bytes read: -81 -31 -81 -29 111 in /formswap/1286300.266000
.moon2/xformxxx.sor, newpos = 100000
-+Reading 33666 bytes in stream 1 at position 503976; stop at 537642
read: 33666 +Last bytes read: 0 -45 -119 81 -114 in /formswap/1286300.266000.m
oon2/xformxxx.sor, newpos = 537642
Want to read in stream 0 at position 100000
--Reading 100000 bytes in stream 0 at position 100000
Last bytes read: -105 -16 -105 -15 87
Want to read in stream 0 at position 200000
--Reading 100000 bytes in stream 0 at position 200000
Last bytes read: -113 -65 -128 -1 15
Want to read in stream 0 at position 300000
--Reading 100000 bytes in stream 0 at position 300000
Last bytes read: -4 105 -4 25 -4
Want to read in stream 0 at position 400000
--Reading 100000 bytes in stream 0 at position 400000
Last bytes read: -2 4 -2 36 -2
Want to read in stream 0 at position 500000
--Reading 3976 bytes in stream 0 at position 500000
Last bytes read: 63 1 11 -116 70
zerror = 1 in stream 0. At position 503976
Closing stream 0
-Last words: 250240 1 1 3 0
zerror = 1 in stream 1. At position 537642
Closing stream 1
-Last words: 266906 1 1 3 0
zerror = -2 in stream 1. At position 537642
FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes
Program terminating at test.frm Line 21 -->
0.34 sec out of 0.43 sec
CleanUpSort removed file /formswap/1286300.266000.moon2/xformxxx.sor
I hope these "small-ish" and quickly running examples are useful.
Changing >= S->pStop[ki]
to > S->pStop[ki]
at https://github.com/vermaseren/form/blob/480a7879fb33b7a9e9233c8e1c2af209fa465332/sources/sort.c#L4065 appears to fix my example in the post above. It doesn't, however, fix @burp's example. Making the same change at https://github.com/vermaseren/form/blob/480a7879fb33b7a9e9233c8e1c2af209fa465332/sources/sort.c#L3996 also does not fix it.
Perhaps it fixes my example "by accident"...
I have just tried running this script with a form binary compiled --without-zlib
. (Inspired by today's post in the forum).
Again, I ran for NTERMS
values 250k-500k. Now, form crashes like
test.frm Line 18 --> Warning: gzip compression not supported on this platform
~~~terms = 262751
EndSort: lPatch = 20, MaxPatches = 32,lFill = 7F7E1A95BD28, sSpace = 75069d, M
axTer = 1250, lTop = 7F7E1A97EFA8
MergePatches created output file /formswap/davies/xformxxx.sor
EndSort: fPatchN = 1, lPatch = 0, position = 6005772
EndSort+: fPatchN = 2, lPatch = 0, position = 6306040
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort
EndSort: sortfile /formswap/davies/xformxxx.sor removed
Program terminating at test.frm Line 34 -->
0.07 sec out of 0.07 sec
for EVERY value of NTERMS
in the range [255823,262752].
As before, there are no valgrind errors before the crash.
If I use a binary with or without zlib, but run with Off compress;
, I find no crashing examples.
The commit 26793e439166fe71722f7a59e3c2260801596ee8 is expected to fix the error like
FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes
which was caused by calling inflateEnd()
twice.
Excellent! I have re-tested the script in https://github.com/vermaseren/form/issues/95#issuecomment-331143602 , and everything seems OK for NTERMS in the range 250K to 500K.
Also @burp 's example now runs without crashing.
The test of https://github.com/vermaseren/form/issues/95#issuecomment-332194966 still fails.
I tested #214 also: still fails.
Thanks for the testing. Nice to hear that the GZIP crash is gone.
The crash with --without-zlib
may be another bug.
I just tried to reduce numbers in https://github.com/vermaseren/form/issues/95#issuecomment-332194966. The following example crashes at N=323
if I use a 64-bit executable configured with --without-zlib
:
#:filepatches 4
#:largesize 25600
#:maxtermsize 200
#:smallsize 12800
#:termsinsmall 16
#do N=320,330
#message N=`N'
S x,k;
L F = sum_(k,1,`N',x^k);
* size: (8 + 6 * (N - 1) + 1) * 4 = 24 * N + 12 (bytes)
.sort
Drop;
L CheckZero = F - {`N'*(`N'+1)/2};
id x^k?pos_ = k;
P;
.sort
Drop;
.sort
#enddo
.end
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort
Somehow I had Valgrind errors:
==18866== Memcheck, a memory error detector
==18866== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18866== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==18866== Command: ./vorm test
==18866==
FORM 4.2.0 (Dec 15 2017, v4.2.0-29-g26793e4) 64-bits Run: Fri Dec 22 19:01:37 2017
...
Time = 0.29 sec Generated terms = 323
F 1 Terms left = 323
Bytes used = 8004
Time = 0.29 sec
F Terms active = 323
Bytes used = 7692
==18866== Conditional jump or move depends on uninitialised value(s)
==18866== at 0x4DE9FF: Compare1 (sort.c:2547)
==18866== by 0x4E110A: MergePatches (sort.c:3889)
==18866== by 0x4E28EF: EndSort (sort.c:1066)
==18866== by 0x4B9241: Processor (proces.c:431)
==18866== by 0x436B1B: DoExecute (execute.c:838)
==18866== by 0x44CD2D: ExecModule (module.c:274)
==18866== by 0x4AEFCF: PreProcessor (pre.c:962)
==18866== by 0x4E76F1: main (startup.c:1607)
==18866==
==18866== Conditional jump or move depends on uninitialised value(s)
==18866== at 0x4DF518: Compare1 (sort.c:2552)
==18866== by 0x4E110A: MergePatches (sort.c:3889)
==18866== by 0x4E28EF: EndSort (sort.c:1066)
==18866== by 0x4B9241: Processor (proces.c:431)
==18866== by 0x436B1B: DoExecute (execute.c:838)
==18866== by 0x44CD2D: ExecModule (module.c:274)
==18866== by 0x4AEFCF: PreProcessor (pre.c:962)
==18866== by 0x4E76F1: main (startup.c:1607)
==18866==
==18866== Conditional jump or move depends on uninitialised value(s)
==18866== at 0x4DF543: Compare1 (sort.c:2930)
==18866== by 0x4E110A: MergePatches (sort.c:3889)
==18866== by 0x4E28EF: EndSort (sort.c:1066)
==18866== by 0x4B9241: Processor (proces.c:431)
==18866== by 0x436B1B: DoExecute (execute.c:838)
==18866== by 0x44CD2D: ExecModule (module.c:274)
==18866== by 0x4AEFCF: PreProcessor (pre.c:962)
==18866== by 0x4E76F1: main (startup.c:1607)
==18866==
Ran into precompressed term
Hi all,
I'm new to form and i ran into a similar error:
7FillInputGZIP: Error in gzip handling of input. zerror = -3 PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes Program terminating in thread 7 at tensorreduction Line 48 -->
And also:
1FillInputGZIP: Error in gzip handling of input. zerror = -3 PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes Program terminating in thread 1 at expandmomenta Line 76 -->
I'm using tform. To be more specific: TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers
I've been wondering if you might have any idea why this happens and if there's perhaps some sort of way to fix this.
Side note: Another interesting error i've seen is:
Error while reading scratch file in GetTerm Program terminating in thread 0 at tensorreduction Line 9 -->
Any idea what might be the cause?
Interesting, this one hasn’t come up for a while. How often do you see it? If it happens all the time for you, are you able to share the code which reproduces it?
It happens all the time. See attached files. wcdimension.dat that appears in the .frm file is empty formfiles.zip
This example is too heavy for debugging, it has run for >2 days (without any crash) so far. It is doing a lot of stagesorting, so many I can set up some stagesort-heavy tests to search for remaining bugs in the gzip system.
Interesting. How many cores are you using?
I was using 8, since you pasted your version as TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers
. How long does it take to crash for you?
I have since run some stagesort-heavy tests with some simple scripts which generate lots of terms, with artificially small buffer sizes, and did not see any issues.
Some of the workers ran into this error after a couple of hours but some ran for around a day or so.
Do you see the problem with the debug build (tvorm)? If so, could you try either running with gdb
and getting a stack trace at the crash, or running with this branch (you will need eu-addr2line
installed)
https://github.com/jodavies/form/tree/backtrace
Also: what OS and compiler versions are you using?
Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?
Do you see the problem with the debug build (tvorm)? If so, could you try either running with
gdb
and getting a stack trace at the crash, or running with this branch (you will needeu-addr2line
installed) https://github.com/jodavies/form/tree/backtrace
I'll try.
Also: what OS and compiler versions are you using?
Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?
I'm using the HPC cluster of my university. The OS used is Linux Rocky 8, not sure about the compiler. And yeah i can do that.
Novice FORM user here, also getting this bug, in this case when my term reached 1 Gb in size:
Time = 1309.76 sec
JJJJJ Terms active = 154494505
Bytes used = 910409679
Time = 1349.28 sec
JJJJJ Terms active = 160670804
Bytes used = 946051262
Time = 1387.71 sec
JJJJJ Terms active = 166180424
Bytes used = 977206982
Time = 1423.05 sec
JJJJJ Terms active = 172187549
Bytes used = 1015540137
Time = 1427.72 sec
JJJJJ Terms active = 172799021
Bytes used = 995310080
FillInputGZIP: Error in gzip handling of input. zerror = -3
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading F2BA8 bytes
Program terminating at p5a9.frm Line 62 -->
1427.75 sec out of 1431.18 sec
If you can reproduce this reliably, could you share your code and FORM settings, so that I can try to investigate?
Yes can do, would you be able to email me then I can send the data. I'm at jamie.vicary@cl.cam.ac.uk.
I took another look at https://github.com/vermaseren/form/issues/95#issuecomment-353648021 since I wanted to determine if it is something that can happen also in zlib mode (with different buffer sizes or expressions etc?).
This branch adds a bunch of debug prints which you can use to compare the running with and without zlib: https://github.com/jodavies/form/tree/issue-95
I trimmed the test to:
#:filepatches 4
#:largesize 25600
#:maxtermsize 200
#:smallsize 12800
#:termsinsmall 16
S x,k;
L F = sum_(k,1,323,x^k);
.end
This crashes with the Ran into precompressed term
as far back as form 3.3.
The test is fixed by changing the the ncomp arg from 1 to 0 in this PutOut but I don't know if this is a fix or a "workaround". https://github.com/jodavies/form/blob/f17e16b6f89ea0576c12db7e93e5edb43f4da8d9/sources/sort.c#L1073
The particular situation is that the large buffer has been filled and there is a patch on the disk. Then there are some terms in the small buffer, but no large patches, and we finish generating terms (powers 321, 322 and 323 are in the small buffer).
EndSort then writes the small buffer terms into a file patch, before calling MergePatches to finish up. At this point the terms in the small buffer have been compressed already (by EndSort calling ComPress) and go through this PutOut above.
PutOut doesn't care that the terms are already compressed, since the output is not going to AR.outfile or AR.hidefile.
So far this is the same for zlib and non-zlib modes.
The difference is that without zlib, PutOut writes the first term that came from the small buffer in a compressed form, so that when it is loaded again things go wrong: this seems wrong, as I understand the first term of the patch should be complete, and only the following terms are compressed. Indeed this is how the terms arrive from the compressed small buffer.
Now I think I have it: https://github.com/vermaseren/form/blob/92f315454515d414e38101178e58178b34162e69/sources/sort.c#L1059-L1060 This reset of AR.CompressPointer is the wrong side of the #ifdef .
Without zlib, this part of the code compresses the first term written out against whatever happened to be in the compression buffer previously.
Several times, I have seen errors like
They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).
I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?
Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.