Closed jodavies closed 5 years ago
This looks a bug in a boilerplate code for a kind of std::vector<NAMESBLOCK>::reserve()
.
nnew
is the new extended buffer but entirely not used. Some of writings to the buffer should goes to nnew
. But the code is a bit chaotic in how the table should be updated in the function and so for now I'm not sure when nnew
should replace with the old vector buffer pointer...
Yes, it looks like this is the problem.
free(d->nblocks);
d->nblocks = nnew;
after the first for
loop clears up all valgrind issues apart from the invalid write at minos.c:1008
.
I guess that the fseek
at minos.c:1033
was previously seeking to a ridiculous place in the file, causing these exabyte "sparse" files.
after the first for loop
I'm not sure that this is the right place, because the comment
sounds like the pointer to the old buffer is still needed to do something, maybe for some checks and setup of the new blocks, until the end of the function.
Well, only d->nblocks
is replaced with a larger array, and the *nblocks
pointers are copied. Everything else in d->
remains intact. Perhaps I misunderstand something, and for sure something is still wrong because of the invalid write later on.
At least, the example above no longer crashes or creates enormous output, and I can load and use the table in other script.
OK, I'm also not familiar with this minos code, and perhaps you are right. Indeed, the fact that your example works is an improvement.
Do you want to make a commit and pull request for this? Then I will merge it.
It looks indeed that those two lines are missing.
When it continues in the next loop it assumes that the space is there and hence nnew should have been installed.
Thanks for finding this.
Jos
On 28 Aug 2019, at 12:54, Takahiro Ueda notifications@github.com wrote:
OK, I'm also not familiar with this minos code, and perhaps you are right. Indeed, the fact that your example works is an improvement.
Do you want to make a commit and pull request for this? Then I will merge it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/318?email_source=notifications&email_token=ABJPCERW55G5T2BS5XZ3DT3QGZKPZA5CNFSM4IQEIZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KW26A#issuecomment-525692280, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJPCEWXS47UX6IR4JM3EF3QGZKPZANCNFSM4IQEIZJQ.
My fork is a real mess it seems. Are you able to cherry-pick a single commit, or just re-implement the change in this repo?
OK. Then I will cherry-pick it from your fork.
Thanks! This is done. (And hopefully, the CI will finish without any errors...)
Perhaps this warrants a separate issue, but some more detail on the commentry of that commit:
If you take the same test file from the first post here, and put
#system cat /proc/`PID_'/io
before .end
, you can inspect the difference between wchar
and write_bytes
. In this example, running from a local disk, wchar
is about 18GB and write_bytes
is 19.4MB (the size of test.tbl).
This means, I think, that FORM has written 18GB to disk in total but has over-written the same data many times, and only 19.4MB was actually committed to disk in the end by the OS.
If you run the same example on a network storage location, these numbers roughly coincide, and if you look at some sort of network throughput monitor while FORM is running you will see that your machine really is transferring all of that 18GB to the file server.
I am not sure why this is, but for this reason I always create tablebases in a tmpfs, and transfer to the proper place afterwards.
The problem with table bases (and databases in general) is that you have to keep all information synchronized. Hence, if you write a new element of the table, you have to update the index. I know there must be better algorithms to do this, but I wanted to play it as safe as possible. This will indeed generate more traffic than absolutely needed.
Jos
On 28 Aug 2019, at 14:48, jodavies notifications@github.com wrote:
Perhaps this warrants a separate issue, but some more detail on the commentry of that commit:
If you take the same test file from the first post here, and put
system cat /proc/`PID_'/io
before .end, you can inspect the difference between wchar and write_bytes. In this example, running from a local disk, wchar is about 18GB and write_bytes is 19.4MB (the size of test.tbl).
This means, I think, that FORM has written 18GB to disk in total but has over-written the same data many times, and only 19.4MB was actually committed to disk in the end by the OS.
If you run the same example on a network storage location, these numbers roughly coincide, and if you look at some sort of network throughput monitor while FORM is running you will see that your machine really is transferring all of that 18GB to the file server.
I am not sure why this is, but for this reason I always create tablebases in a tmpfs, and transfer to the proper place afterwards.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/318?email_source=notifications&email_token=ABJPCERRB7ZMN3MDYPM4AFLQGZX2FA5CNFSM4IQEIZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5LAAOI#issuecomment-525729849, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJPCETNBLBGMCGOKZFA6ZDQGZX2FANCNFSM4IQEIZJQ.
The CI successfully finished, so I would like to close this issue for now. (But of course we can still discuss the heavy traffic problem here, or we can make and move to another issue if we have ideas for performance improvement).
The examples in this issue all relate to tablebases containing hundreds of tables.
Just for information, 0b3ab5d also fixes problems with tablebases containing a single table with millions of entries.
In this case the tablebase was created without apparent errors, but upon applying it to an expression in another form script, we received the errors Could not read rhs
and Cannot uncompress element %s in file
.
The following program produces some invalid reads and writes in
minos.c: PutTableNames
. I can't always get it to run successfully, it sometimes crashes.I can't remember why I made this test file (some months ago -- I think someone asked me for help with crashes related to tablebases containing many different tables) but it seems that it sometimes (not always) creates something like "sparse" files, which claim to be much bigger than they really are.
The size seems to depend on the filesystem, but when this program misbehaves it seems to produce a 1.9TB file on ext4, and a 5.6EB file (!!!) on nfs. Needless to say, this caused some problems when rsync attempted to back this file up to a remote location. The true size of the file is just 20MB.
Perhaps the errors reported by valgrind are responsible for the occasional crazy files?
I attach a file with the output: vg.log
Thanks, Josh.