vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
1.15k stars 136 forks source link

valgrind errors when creating tablebases #318

Closed jodavies closed 5 years ago

jodavies commented 5 years ago

The following program produces some invalid reads and writes in minos.c: PutTableNames. I can't always get it to run successfully, it sometimes crashes.

Symbol x,y,z;
TableBase "test.tbl" create;
.sort

#do i = 1, 600
    #message `i'
    Table sparse, zerofill, inttab`i'(2);
    #do j = 1,100
        Fill inttab`i'(`j',1) = x;
    #enddo
    TableBase "test.tbl" addto inttab`i';
    .sort
#enddo
.end

I can't remember why I made this test file (some months ago -- I think someone asked me for help with crashes related to tablebases containing many different tables) but it seems that it sometimes (not always) creates something like "sparse" files, which claim to be much bigger than they really are.

The size seems to depend on the filesystem, but when this program misbehaves it seems to produce a 1.9TB file on ext4, and a 5.6EB file (!!!) on nfs. Needless to say, this caused some problems when rsync attempted to back this file up to a remote location. The true size of the file is just 20MB.

Perhaps the errors reported by valgrind are responsible for the occasional crazy files?

I attach a file with the output: vg.log

Thanks, Josh.

tueda commented 5 years ago

This looks a bug in a boilerplate code for a kind of std::vector<NAMESBLOCK>::reserve().

https://github.com/vermaseren/form/blob/558b01fb27251e1a3a8e4799e6bfdcdf68c1d36c/sources/minos.c#L965-L988

nnew is the new extended buffer but entirely not used. Some of writings to the buffer should goes to nnew. But the code is a bit chaotic in how the table should be updated in the function and so for now I'm not sure when nnew should replace with the old vector buffer pointer...

jodavies commented 5 years ago

Yes, it looks like this is the problem.

free(d->nblocks); 
d->nblocks = nnew;

after the first for loop clears up all valgrind issues apart from the invalid write at minos.c:1008.

I guess that the fseek at minos.c:1033 was previously seeking to a ridiculous place in the file, causing these exabyte "sparse" files.

tueda commented 5 years ago

after the first for loop

I'm not sure that this is the right place, because the comment

https://github.com/vermaseren/form/blob/558b01fb27251e1a3a8e4799e6bfdcdf68c1d36c/sources/minos.c#L989-L991

sounds like the pointer to the old buffer is still needed to do something, maybe for some checks and setup of the new blocks, until the end of the function.

jodavies commented 5 years ago

Well, only d->nblocks is replaced with a larger array, and the *nblocks pointers are copied. Everything else in d-> remains intact. Perhaps I misunderstand something, and for sure something is still wrong because of the invalid write later on.

At least, the example above no longer crashes or creates enormous output, and I can load and use the table in other script.

tueda commented 5 years ago

OK, I'm also not familiar with this minos code, and perhaps you are right. Indeed, the fact that your example works is an improvement.

Do you want to make a commit and pull request for this? Then I will merge it.

vermaseren commented 5 years ago

It looks indeed that those two lines are missing.

When it continues in the next loop it assumes that the space is there and hence nnew should have been installed.

Thanks for finding this.

Jos

On 28 Aug 2019, at 12:54, Takahiro Ueda notifications@github.com wrote:

OK, I'm also not familiar with this minos code, and perhaps you are right. Indeed, the fact that your example works is an improvement.

Do you want to make a commit and pull request for this? Then I will merge it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/318?email_source=notifications&email_token=ABJPCERW55G5T2BS5XZ3DT3QGZKPZA5CNFSM4IQEIZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KW26A#issuecomment-525692280, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJPCEWXS47UX6IR4JM3EF3QGZKPZANCNFSM4IQEIZJQ.

jodavies commented 5 years ago

My fork is a real mess it seems. Are you able to cherry-pick a single commit, or just re-implement the change in this repo?

tueda commented 5 years ago

OK. Then I will cherry-pick it from your fork.

tueda commented 5 years ago

Thanks! This is done. (And hopefully, the CI will finish without any errors...)

jodavies commented 5 years ago

Perhaps this warrants a separate issue, but some more detail on the commentry of that commit:

If you take the same test file from the first post here, and put

#system cat /proc/`PID_'/io

before .end, you can inspect the difference between wchar and write_bytes. In this example, running from a local disk, wchar is about 18GB and write_bytes is 19.4MB (the size of test.tbl).

This means, I think, that FORM has written 18GB to disk in total but has over-written the same data many times, and only 19.4MB was actually committed to disk in the end by the OS.

If you run the same example on a network storage location, these numbers roughly coincide, and if you look at some sort of network throughput monitor while FORM is running you will see that your machine really is transferring all of that 18GB to the file server.

I am not sure why this is, but for this reason I always create tablebases in a tmpfs, and transfer to the proper place afterwards.

vermaseren commented 5 years ago

The problem with table bases (and databases in general) is that you have to keep all information synchronized. Hence, if you write a new element of the table, you have to update the index. I know there must be better algorithms to do this, but I wanted to play it as safe as possible. This will indeed generate more traffic than absolutely needed.

Jos

On 28 Aug 2019, at 14:48, jodavies notifications@github.com wrote:

Perhaps this warrants a separate issue, but some more detail on the commentry of that commit:

If you take the same test file from the first post here, and put

system cat /proc/`PID_'/io

before .end, you can inspect the difference between wchar and write_bytes. In this example, running from a local disk, wchar is about 18GB and write_bytes is 19.4MB (the size of test.tbl).

This means, I think, that FORM has written 18GB to disk in total but has over-written the same data many times, and only 19.4MB was actually committed to disk in the end by the OS.

If you run the same example on a network storage location, these numbers roughly coincide, and if you look at some sort of network throughput monitor while FORM is running you will see that your machine really is transferring all of that 18GB to the file server.

I am not sure why this is, but for this reason I always create tablebases in a tmpfs, and transfer to the proper place afterwards.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/318?email_source=notifications&email_token=ABJPCERRB7ZMN3MDYPM4AFLQGZX2FA5CNFSM4IQEIZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5LAAOI#issuecomment-525729849, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJPCETNBLBGMCGOKZFA6ZDQGZX2FANCNFSM4IQEIZJQ.

tueda commented 5 years ago

The CI successfully finished, so I would like to close this issue for now. (But of course we can still discuss the heavy traffic problem here, or we can make and move to another issue if we have ideas for performance improvement).

jodavies commented 4 years ago

The examples in this issue all relate to tablebases containing hundreds of tables.

Just for information, 0b3ab5d also fixes problems with tablebases containing a single table with millions of entries.

In this case the tablebase was created without apparent errors, but upon applying it to an expression in another form script, we received the errors Could not read rhs and Cannot uncompress element %s in file.