Closed johnbent closed 11 years ago
One more observation: If you use
fd = open( argv[1], O_WRONLY|O_CREAT, 0600 ); ... write()
everything is fine (have index), even with 2+ PE on Smog.
///////////////////////////// Using open() ////////////////////////////////////////
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
int fd;
fd = open( argv[1], O_WRONLY|O_CREAT, 0600 );
if ( fd == -1 ) {
printf("open error\n");
exit(1);
}
int i;
for ( i = 0 ; i < 1000 ; i++ ) {
write(fd, "abc", 3);
}
close(fd);
}
Link to reproducer:
https://dl.dropbox.com/u/3442222/github_attachments/zero-index.tar.gz
Try reproducing on Cielito while Smog's Lustre is unavailable.
I have been unable to reproduce this on cielito. I have used version 2.4 and master (as of 8/21) with various pe and backend counts with the same outcome as shown below:
Starting PLFS on ct-login1.localdomain:/users/atorrez/plfs.atorrez/lscratch1/n1 /users/atorrez/plfs.atorrez/lscratch1/n1/atorrez/Aug22075645545091855.1.w.dat w
0 w 0 3000 1377179805.5659279823303223 1377179805.5662109851837158 2999 [0. 0]
/users/atorrez/plfs.atorrez/lscratch1/n1/atorrez/Aug22075649597507779.16.w.dat w
0 w 0 3000 1377179809.6044900417327881 1377179809.6047160625457764 2999 [0. 0]
Note: I modified the script to handle mounting and unmounting of plfs and removed the aprun. I then logged into an internal login node and arpun'd that script. The script is shown below:
plfs /users/atorrez/plfs.atorrez/lscratch1/n1 filemodelist="w" nplist="1 16"
plfsdir_n1=/users/atorrez/plfs.atorrez/lscratch1/n1/atorrez
dirlist="$plfsdir_n1"
for filemode in $filemodelist
do
for filedir in $dirlist
do
for np in $nplist
do
runid=date '+%b%d%H%M%S%N'
filename=$runid.$np.$filemode.dat
filepath=$filedir/$filename
echo $filepath $filemode
./fopen.x $filepath $filemode
plfs_map $filepath
sleep 4
done
done
done
fusermount -u /users/atorrez/plfs.atorrez/lscratch1/n1
Closing this because cannot reproduce.
After looking into MILC's I/O code, I found that it uses fopen(). When the time to save checkpoints comes, many ranks open the same file by fopen() and then write to the file. By doing similar things with a simple program, I was able to reproduce the no-index-file error. This means there no index dropping in the backends, and plfs_map shows that the file has 0 entry.
The reproducer is attached. The main.c attached is just a simple program using fopen() to open a file and write some data to it.
Do the following to run the reproducer: $ make $ # modify plfs dir path in runtest.sh $ ./runtest.sh
You will see something like this (plfs_map output of two files):
If you do ls on the backend of the 16PE file:
Some observations: