vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
982 stars 118 forks source link

gzip errors #95

Open jodavies opened 8 years ago

jodavies commented 8 years ago

Several times, I have seen errors like

1FillInputGZIP: Error in gzip handling of input.
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 124F80 bytes
Program terminating in thread 1 at replace Line 18 -->

They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).

I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?

Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.

vermaseren commented 8 years ago

Hi Josh,

For now it does not ring a bell. Problem with retrying is that compressing/decompressing is something with memory. That means you cannot jump into the middle, and you can also not just retry. You would have to start from the ‘beginning’ again. This could be possible only if the compression would be arranged as separate entities and not as one complete object for a whole patch. Unless of course the error occurs at the start of a patch.

How often has this occurred to you?

There is a still undocumented feature (I have not yet tested it. It was built by Ali Mirsoleimani) to use BZIP2 instead of GZIP. This could give shorter output and depending of the level it could also be a little bit faster (definitely compared to GZIP n with n >6). If you try this, let me know the results.

Jos

On 20 mei 2016, at 12:23, jodavies notifications@github.com wrote:

Several times, I have seen errors like

1FillInputGZIP: Error in gzip handling of input. PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 124F80 bytes Program terminating in thread 1 at replace Line 18 --> They are not reproducible, generally I can re-run the calculation and everything goes through without problems. It is hard to know what is to blame here, perhaps they are genuine read errors from the disk (although I have seen this issue on more than one machine).

I don't know if there would be much point, upon receiving an error from zlib, retrying the inflate?

Have you seen this before? Andreas says he does not recall ever seeing a crash like this. Not a very useful report I suppose, but I thought I would mention it in case you have any ideas.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/vermaseren/form/issues/95

jodavies commented 8 years ago

I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.

It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.

Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).

vermaseren commented 8 years ago

Hi Josh,

gzip compression is only used for the sort files. In the large buffer only the standard form compression is used which is something rather simple.

Jos

On 20 mei 2016, at 12:54, jodavies notifications@github.com wrote:

I have seen this maybe 5 times over the last few months. They were all in cases with rather large (O(TB)) files.

It looks like the branch with the bzip code is too old to run forcer so I guess that would need to be merged, or, I could set up some tests with mincer.

Out of interest, are the patches in the large buffer compressed, or just after sorting them and writing to disk? (Another thing one could try is to disable FORM's zlib compression entirely and leave it all to the btrfs filesystem).

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/vermaseren/form/issues/95#issuecomment-220576360

jodavies commented 8 years ago

I have also seen, twice, errors like

Time =  108303.46 sec    Generated terms =   71246936
             d5d90522070 Terms left      =   70733861
     dotrewrite-d89-0-25 Bytes used      =15614237544
Read error in SetScratch
Program terminating in thread 3 at replace Line 18 -->

It is possible that these are real hardware errors. I will try to investigate whether this is the case.

Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.

I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.

Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.

* 192GB
#: MaxTermSize 300K
#: WorkSpace 400M
#: LargeSize 44G
#: SmallSize 6G
#: ScratchSize 16G
#: HideSize 8G
#: TermsInSmall 50M
#: LargePatches 16384
#: FilePatches 16384
vermaseren commented 8 years ago

Hi Josh,

A good setting for TermsInSmall is SmallSize/(typical term size). In your crash that is in form compressed form about 200 bytes. In reality in the small buffer the terms are not compressed and hence about 400 bytes is more realistic. 6Gbytes/50Mbytes is 120, hence there should be no problem there. It is possible that the above is a very rare error in FORM of course. I have had some of that in the past that only in some very special cases something went wrong and only when things were very big. And then, when this happens with tform, it may occur only once out of many runs. This is very hard to debug as you can imagine. The setup parameters look rather fine to me. If the ratio between LargeSize and SmallSize is not very large, you may not need so many LargePatches, unless sorting the small buffer gives consistently a very large collapse in the number of terms. This is different with the file patches. But indeed, it is important to avoid stage4, because that is never fast (I have been in stage5 only once, but it worked).

If you can find out more about these crashes, maybe we can sooner or later make them deterministic and have a chance to find the cause.

Jos

On 24 mei 2016, at 13:04, jodavies notifications@github.com wrote:

I have also seen, twice, errors like

Time = 108303.46 sec Generated terms = 71246936 d5d90522070 Terms left = 70733861 dotrewrite-d89-0-25 Bytes used =15614237544 Read error in SetScratch Program terminating in thread 3 at replace Line 18 --> It is possible that these are real hardware errors. I will try to investigate whether this is the case.

Would you anticipate any issues with the following setup parameters? I run a very large TermsInSmall as otherwise the serial sections of the reduction become very slow with much master sorting into a .sor file, since it hits a smaller TermsInSmall limit very often and produces a lot of patches.

I also allow very large numbers of LargePatches and FilePatches. For disk capacity reasons, it is important not to go into stage4 sorting.

Perhaps at some point, some of these parameters are multiplied together, and could exceed a 32bit int? I don't know how such errors would manifest.

burp commented 8 years ago

I have exactly the same issue (with default form settings), and since it is reproducible for me I did a bisection to track it down from a good 4.1 tag and a bad head to: " Fixed the compress/gzip/hide bug." 43a5b1ea18006d851d12d321a408d6516d2f0da6. Hope this helps!

Cheers, Tobias

jodavies commented 8 years ago

@burp , are you able to provide a script for your reproducible example, if it is small and crashes quickly?

I have a calculation which has crashed the last 4 times I have run it, but it produces 2TB of scratch files and takes more than a day to crash.

burp commented 8 years ago

It crashes within 1-2 minutes, that's why I was able to do a quick bisection. Unfortunately it is part of a bigger setup with tons of includes etc. I will try to boil it down to one simple script/file.

jodavies commented 8 years ago

@burp , were you able to create a simple example which crashes like this in the end?

I have had this crash on a second machine now, making genuine disk errors less likely.

Should we also have && fi->zsp != 0 in the if statement of line 1266, sort.c, in the function Sflush ?

burp commented 8 years ago

@joedavies: Have not had time for it yet, maybe I can just send you the setup in private. You will have to adjust a few include paths etc., but it's probably simpler for me than trying to minimize the reproducible example.

jodavies commented 8 years ago

That works, if it is OK with you!

Probably it is best if you can send Jos a copy also, he is much more likely to be able to find possible bugs.

jodavies commented 7 years ago

I had a quick look at this example. With GZIPDEBUG enabled at various places, during normal operation, one sees output like the following,

Preparing z-stream 0 with compression 1
...
Preparing z-stream 4 with compression 1

-+Reading 160000 bytes in stream 0 at position          0; stop at    5320619
Want to read in stream 0 at position     160000
--Reading 160000 bytes in stream 0 at position     160000
...
Closing stream 0
--Reading 160000 bytes in stream 1 at position    5480619
...
Closing stream 1
--Reading 160000 bytes in stream 2 at position   10310142
...
...
Closing stream 4
etc

In the example which crashes, we find

Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 160000 bytes in stream 0 at position          0; stop at    5486806
--Reading 160000 bytes in stream 0 at position     160000
...
Closing stream 0
  -Last words: 1 1 2 -3 0
 zerror = -2 in stream 0. At position    5486806

It seems like FORM is trying to read from stream 0, which was already "closed"?

The offending call of PutIn() is the one below the label NextTerm: in sort.c. It seems ki has the value 0 when it should be 1?

jodavies commented 7 years ago

This crash does not occur if one changes MaxTermSize (and thus the buffersize passed to FillInputGZIP).

The crash scenario seems to be:

EDIT: there are no valgrind errors

tueda commented 7 years ago

@jodavies Is there any public test code to reproduce the bug?

jodavies commented 7 years ago

I am using the code from @burp , of which I think Jos also has a copy. I have not been able to reduce it to a particularly minimal crashing example.

jodavies commented 6 years ago

I have made a nice example which demonstrates this crash. The script is

#-
* Try to find something which causes the gzip crash. It seems to be due to some
* tricky combination of number of patches or buffer sizes etc.
* We make an expression with an increasing number of terms until it crashes.

* Buffer settings, smaller than default. Hopefully we can hit the error with smaller expressions?
* Divide everything by 8
#: filepatches 32
#: largepatches 32
#: largesize 6250000
#: maxtermsize 1250
#: smallsize 1250000
#: smallextension 2500000
#: termsinsmall 12500

Off Statistics;

Symbol x,n;

#message terms = `NTERMS'
Local test = sum_(n,1,`NTERMS',x^n);
.sort

* Read and write terms
.sort

* Check all terms present
Identify x^n?pos_ = n;
.sort
Local test = test - `NTERMS'*(`NTERMS'+1)/2;

Print;
.end

I run this as an array task on a cluster, to scan over NTERMS. I grep the log file for "test = 0;" and keep a copy of failures.

I am using the current git version of form, with GZIPDEBUG enabled in sort.c and compress.c. These examples crash with 4.2.0 and 4.1 also. EDIT: an old FORM 4.0 binary on our network does not crash, I don't know if FORM 4.0 supported GZIP or not?

I have found the following crashing examples in the range 250K to 500K:

266906
279406
291906
304406
316906
329406
341906
354406
366906
379406
391906
404406
416906
429406
441906
454406
466906
479406
491906

(generated by 254406 + 12500 n) (same setup, maxtermsize = 1300: same as above) (same setup, termsinsmall = 25000: crashes at 254246 + 12500 n)

with output like

FORM 4.2.0 (Sep 21 2017, v4.2.0-16-g480a787-dirty) 64-bits  Run: Thu Sep 21 14:31:21 2017
    #-
~~~terms = 266906
 MergePatches created output file /formswap/1286300.266000.moon2/xformxxx.sor
Writing 100000 bytes at          0: -81 -31 -81 -29 111
Writing 100000 bytes at     100000: -105 -16 -105 -15 87
Writing 100000 bytes at     200000: -113 -65 -128 -1 15
Writing 100000 bytes at     300000: -4 105 -4 25 -4
Writing 100000 bytes at     400000
Writing 96024 bytes at     500000
   Last bytes written: 63 1 11 -116 70
     Perceived position in FlushOutputGZIP is     503976
Writing 66334 bytes at     503976
   Last bytes written: 0 -45 -119 81 -114
     Perceived position in FlushOutputGZIP is     537642
 EndSort: fPatchN = 2, lPatch = 2, position =       537642
 fPatchesStop[0] =     503976
 fPatchesStop[1] =     537642
Preparing z-stream 0 with compression 1
Preparing z-stream 1 with compression 1
-+Reading 100000 bytes in stream 0 at position          0; stop at     503976
 read: 100000 +Last bytes read: -81 -31 -81 -29 111 in /formswap/1286300.266000
.moon2/xformxxx.sor, newpos =     100000
-+Reading 33666 bytes in stream 1 at position     503976; stop at     537642
 read: 33666 +Last bytes read: 0 -45 -119 81 -114 in /formswap/1286300.266000.m
oon2/xformxxx.sor, newpos =     537642
Want to read in stream 0 at position     100000
--Reading 100000 bytes in stream 0 at position     100000
   Last bytes read: -105 -16 -105 -15 87
Want to read in stream 0 at position     200000
--Reading 100000 bytes in stream 0 at position     200000
   Last bytes read: -113 -65 -128 -1 15
Want to read in stream 0 at position     300000
--Reading 100000 bytes in stream 0 at position     300000
   Last bytes read: -4 105 -4 25 -4
Want to read in stream 0 at position     400000
--Reading 100000 bytes in stream 0 at position     400000
   Last bytes read: -2 4 -2 36 -2
Want to read in stream 0 at position     500000
--Reading 3976 bytes in stream 0 at position     500000
   Last bytes read: 63 1 11 -116 70
 zerror = 1 in stream 0. At position     503976
Closing stream 0
  -Last words: 250240 1 1 3 0
 zerror = 1 in stream 1. At position     537642
Closing stream 1
  -Last words: 266906 1 1 3 0
 zerror = -2 in stream 1. At position     537642
FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes
Program terminating at test.frm Line 21 --> 
  0.34 sec out of 0.43 sec
 CleanUpSort removed file /formswap/1286300.266000.moon2/xformxxx.sor

I hope these "small-ish" and quickly running examples are useful.

jodavies commented 6 years ago

Changing >= S->pStop[ki] to > S->pStop[ki] at https://github.com/vermaseren/form/blob/480a7879fb33b7a9e9233c8e1c2af209fa465332/sources/sort.c#L4065 appears to fix my example in the post above. It doesn't, however, fix @burp's example. Making the same change at https://github.com/vermaseren/form/blob/480a7879fb33b7a9e9233c8e1c2af209fa465332/sources/sort.c#L3996 also does not fix it.

Perhaps it fixes my example "by accident"...

jodavies commented 6 years ago

I have just tried running this script with a form binary compiled --without-zlib. (Inspired by today's post in the forum).

Again, I ran for NTERMS values 250k-500k. Now, form crashes like

test.frm Line 18 --> Warning: gzip compression not supported on this platform
~~~terms = 262751
 EndSort: lPatch = 20, MaxPatches = 32,lFill = 7F7E1A95BD28, sSpace = 75069d, M
axTer = 1250, lTop = 7F7E1A97EFA8
 MergePatches created output file /formswap/davies/xformxxx.sor
 EndSort: fPatchN = 1, lPatch = 0, position =      6005772
 EndSort+: fPatchN = 2, lPatch = 0, position =      6306040
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort
EndSort: sortfile /formswap/davies/xformxxx.sor removed
Program terminating at test.frm Line 34 --> 
  0.07 sec out of 0.07 sec

for EVERY value of NTERMS in the range [255823,262752].

As before, there are no valgrind errors before the crash.

If I use a binary with or without zlib, but run with Off compress;, I find no crashing examples.

tueda commented 6 years ago

The commit 26793e439166fe71722f7a59e3c2260801596ee8 is expected to fix the error like

FillInputGZIP: Error in gzip handling of input. zerror = -2
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 186A0 bytes

which was caused by calling inflateEnd() twice.

jodavies commented 6 years ago

Excellent! I have re-tested the script in https://github.com/vermaseren/form/issues/95#issuecomment-331143602 , and everything seems OK for NTERMS in the range 250K to 500K.

Also @burp 's example now runs without crashing.

The test of https://github.com/vermaseren/form/issues/95#issuecomment-332194966 still fails.

I tested #214 also: still fails.

tueda commented 6 years ago

Thanks for the testing. Nice to hear that the GZIP crash is gone.

The crash with --without-zlib may be another bug.

tueda commented 6 years ago

I just tried to reduce numbers in https://github.com/vermaseren/form/issues/95#issuecomment-332194966. The following example crashes at N=323 if I use a 64-bit executable configured with --without-zlib:

#:filepatches        4
#:largesize      25600
#:maxtermsize      200
#:smallsize      12800
#:termsinsmall      16

#do N=320,330
  #message N=`N'
  S x,k;
  L F = sum_(k,1,`N',x^k);
* size: (8 + 6 * (N - 1) + 1) * 4 = 24 * N + 12 (bytes)
  .sort
  Drop;
  L CheckZero = F - {`N'*(`N'+1)/2};
  id x^k?pos_ = k;
  P;
  .sort
  Drop;
  .sort
#enddo

.end
Ran into precompressed term
Called from MergePatches with k = 2 (stream 1)
Called from EndSort

Somehow I had Valgrind errors:

==18866== Memcheck, a memory error detector
==18866== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18866== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==18866== Command: ./vorm test
==18866== 
FORM 4.2.0 (Dec 15 2017, v4.2.0-29-g26793e4) 64-bits  Run: Fri Dec 22 19:01:37 2017

...

Time =       0.29 sec    Generated terms =        323
               F       1 Terms left      =        323
                         Bytes used      =       8004

Time =       0.29 sec
               F         Terms active    =        323
                         Bytes used      =       7692
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DE9FF: Compare1 (sort.c:2547)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DF518: Compare1 (sort.c:2552)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
==18866== Conditional jump or move depends on uninitialised value(s)
==18866==    at 0x4DF543: Compare1 (sort.c:2930)
==18866==    by 0x4E110A: MergePatches (sort.c:3889)
==18866==    by 0x4E28EF: EndSort (sort.c:1066)
==18866==    by 0x4B9241: Processor (proces.c:431)
==18866==    by 0x436B1B: DoExecute (execute.c:838)
==18866==    by 0x44CD2D: ExecModule (module.c:274)
==18866==    by 0x4AEFCF: PreProcessor (pre.c:962)
==18866==    by 0x4E76F1: main (startup.c:1607)
==18866== 
Ran into precompressed term
alonli1 commented 1 month ago

Hi all,

I'm new to form and i ran into a similar error:

7FillInputGZIP: Error in gzip handling of input. zerror = -3 PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes Program terminating in thread 7 at tensorreduction Line 48 -->

And also:

1FillInputGZIP: Error in gzip handling of input. zerror = -3 PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 1E125C bytes Program terminating in thread 1 at expandmomenta Line 76 -->

I'm using tform. To be more specific: TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers

I've been wondering if you might have any idea why this happens and if there's perhaps some sort of way to fix this.

Side note: Another interesting error i've seen is:

Error while reading scratch file in GetTerm Program terminating in thread 0 at tensorreduction Line 9 -->

Any idea what might be the cause?

jodavies commented 1 month ago

Interesting, this one hasn’t come up for a while. How often do you see it? If it happens all the time for you, are you able to share the code which reproduces it?

alonli1 commented 1 month ago

It happens all the time. See attached files. wcdimension.dat that appears in the .frm file is empty formfiles.zip

jodavies commented 1 month ago

This example is too heavy for debugging, it has run for >2 days (without any crash) so far. It is doing a lot of stagesorting, so many I can set up some stagesort-heavy tests to search for remaining bugs in the gzip system.

alonli1 commented 1 month ago

Interesting. How many cores are you using?

jodavies commented 1 month ago

I was using 8, since you pasted your version as TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 8 workers. How long does it take to crash for you?

I have since run some stagesort-heavy tests with some simple scripts which generate lots of terms, with artificially small buffer sizes, and did not see any issues.

alonli1 commented 1 month ago

Some of the workers ran into this error after a couple of hours but some ran for around a day or so.

jodavies commented 1 month ago

Do you see the problem with the debug build (tvorm)? If so, could you try either running with gdb and getting a stack trace at the crash, or running with this branch (you will need eu-addr2line installed) https://github.com/jodavies/form/tree/backtrace

jodavies commented 1 month ago

Also: what OS and compiler versions are you using?

Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?

alonli1 commented 1 month ago

Do you see the problem with the debug build (tvorm)? If so, could you try either running with gdb and getting a stack trace at the crash, or running with this branch (you will need eu-addr2line installed) https://github.com/jodavies/form/tree/backtrace

I'll try.

alonli1 commented 1 month ago

Also: what OS and compiler versions are you using?

Could you also define GZIPDEBUG in sort.c and compress.c, for your tests?

I'm using the HPC cluster of my university. The OS used is Linux Rocky 8, not sure about the compiler. And yeah i can do that.