Closed gamename closed 4 years ago
Hello Tennis,
It seems fpsync has a problem getting next sync job to do.
Can you tell me what version of fpsync you are using ? Also, can you try to run fpsync in trace mode :
$ sh -x $(which fpsync) [your options] 1>/tmp/out.log 2>&1
and look at /tmp/out.log to see when the problem is triggered ?
Can you also check that your tmp dir (/tmp/fpsync by default) is OK when the problem appears ?
Hi Ganael,
Thank you for the quick reply. I am using version 1.0.
[ec2-user@ip-10-110-20-85 ~]$ fpsync --version
/usr/bin/fpsync: illegal option -- -
fpsync v1.0.0 - Sync directories in parallel using fpart and rsync
Copyright (c) 2014-2017 Ganael LAPLANCHE <ganael.laplanche@martymac.org>
WWW: http://contribs.martymac.org
I am working on getting a trace file for you.
As for the /tmp/fpsync directory, it appears ok (below). Is there anything in particular I should be looking for?:
[ec2-user@ip-10-110-20-85 tmp]$ ls -la /tmp/fpsync
total 24
drwxr-xr-x 6 root root 4096 Jul 21 13:49 .
drwxrwxrwt 4 root root 4096 Jul 22 03:49 ..
drwxr-xr-x 3 root root 4096 Jul 21 13:49 log
drwxr-xr-x 3 root root 4096 Jul 21 13:49 parts
drwxr-xr-x 3 root root 4096 Jul 21 13:49 queue
drwxr-xr-x 3 root root 4096 Jul 21 13:49 work
Thanks -Tennis
@martymac Here is an fpsync log for you fpsync.log.zip
@martymac Could you have a look at this, please? I'm afraid this is now a production problem.
Thanks for your time,
-Tennis
Hello Tennis,
Sorry for lagging. I was looking for I/O errors on your /tmp, but that does not seem to be the case. Unfortunately, the logs you uploaded do not provide much more information. We can see that everything seems to work well until grep starts to throw write errors and further both grep and ls. That seem to be a system-related error ; maybe you are running out of memory ? Do you see something in your kernel logs ? Maybe a process in the pipe has been OOM killed ?
Hi Ganael,
No problem. Thanks for getting back to me. I'll look into those questions. Also, what is the impact of these errors? Do they prevent data from being copied?
Yes, data will probably not be copied properly as that part of the code is related to job queue management.
Moreover, if your system is missing memory, it will probably not be able to cope with 64 rsync processes in parallel. Try to reduce the number of parallel jobs (option -n) as well as increasing their size by playing with per-job max-size/file number limits (options -f and -s) to see what happens. Finally (and the most important), try to monitor your system once the above errors appear.
Hope this helps, Best regards,
Ganael
Thanks! We will try those changes.
@themedina FYI
@martymac - I'm working with @gamename and we just completed our FPSync job (took about 48 hours for 7390/7390 parts), and the last log line states that FPSync completed without errors. This is odd to us, we very clearly saw errors when monitoring the job. Additionally, we monitored via HTOP while the job was running, and memory was never an issue on our EC2 C4.2XL. Let me know your thoughts.
Thanks you,
Before exiting, fpsync checks for every stderr output from all sync jobs it has run. If you got no errors (such as here), you can be sure that every job ran without error, but as the errors you get during sync are (I presume) located in the loop that picks jobs :
https://github.com/martymac/fpart/blob/master/tools/fpsync#L540
some jobs may have not run at all.
The question is why does grep throw a write error here (breaking the pipe you see in the code above) ? If this is not related to missing memory, maybe you are hitting a system limit (such as max pipe size, see ulimit -a for tunable limits) ? To debug that, you can try to reproduce the problem once that limit seems to be reached (try to run the command in the pipe by hand) and see what the problem is.
Another way to (maybe) avoid that limit would be, as suggested, to reduce the number of partitions produced (see my previous answer).
Best regards, Ganael.
Well, reducing the number of threads from 64 to 32 does not help. We still get the same errors. I'll try reducing the number of partitions next.
@martymac, in your last response you mentioned reproducing the issue by running the pipe by hand. What do you have in mind? How would I do that?
Also note that we are using version 1.0 of the code. The repository version is 1.2. Are there any changes in 1.2 that could help?
@TheMedina FYI
Hi,
I meant just try running :
cd "${JOBS_QUEUEDIR}" && ls -rt1 | grep -v "info" | head -n 1
and try to see what breaks. ($JOBS_QUEUEDIR
usually being /tmp/fpsync/queue/<job name>
).
You can try version 1.2 but that part of the code has not been modified in between.
Best regards,
Ganael.
No luck. I'm running with this combination (below) and I'm still seeing the same errors:
/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 10000 -vvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/
I'll try the manual command @martymac suggests next.
I tried the manual command and nothing seems wrong:
[ec2-user@ip-10-80-20-26 ~]$ export JOBS_QUEUEDIR=/tmp/fpsync/queue/1595670573-23817/
[ec2-user@ip-10-80-20-26 ~]$ cd "${JOBS_QUEUEDIR}" && ls -rt1 | grep -v "info" | head -n 1
2561
Hello,
Sorry, as I am unable to reproduce the problem, it's quite difficult to diagnose it here (I've never been reported that issue before, especially for that command pipe that should just go well ; that's why I really suspect a system issue here).
Anyway, I've seen you limited the number of files (-f option), maybe you can also try to limit the size of partitions (-s option) too ? The goal is to limit the final number of partitions produced to see if your problem is related to a too high number of partitions being listed in the queue directory.
Also, if you can access your src/ and dest/ FS from another host, maybe you can try to run the sync process from there ?
Finally, as you suggested, you can try to upgrade to latest fpart+fpsync version (1.2.0). It should not especially help for your problem, but you will gain new features (and bugfixes) as well.
Best regards,
Ganael.
Hi Ganael,
Thank you for your continued support. Its really appreciated. 👍
Like you, I suspect the system could be the problem. I just can't prove it so far. I will keep researching this angle too.
Per your suggestion, I will try limiting the partition sizes next (the -s option).
Here is a little more about our design. This is all taking place on AWS. We are copying data from EBS volumes to EFS (NFS) volumes. The instances doing the copying are C4.2xlarge. They are only used for copying one time, then they self-terminate.
The progression goes like this:
Instances are never re-used. They are used to drive one copy process and then terminated. We are seeing the same issue on multiple systems. Again, each system is new. But they all share a common configuration (cpu size, memory, etc).
Also the amount of data being copied is very large. Multiple terabytes is common. For example, here is 'df -h' output from one copy instance:
[root@ip-10-80-20-154 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 7.4G 76K 7.4G 1% /dev
tmpfs 7.4G 0 7.4G 0% /dev/shm
/dev/xvda1 99G 5.0G 94G 6% /
10.88.39.137:/ 8.0E 24T 8.0E 1% /efs
/dev/xvdb1 16T 13T 3.5T 79% /ebs
In the above display, the contents of /ebs
is being copied to /efs
. Notice that /ebs
is 16TB.
-Tennis
Possibly some good news. I ran this combination of options (note the -f
value is one hundred thousand):
/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 100000 -vvvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/
I still get errors, but now I'm only getting 1 error. The second ls
error has disappeared (below).
2020-07-26T07:37:52.443-05:00 | => [QMGR] Starting job /tmp/fpsync/work/1595752557-23797/2792 (local)
-- | --
| 2020-07-26T07:43:23.599-05:00 | <= [QMGR] Job 28193:local finished
| 2020-07-26T07:43:23.599-05:00 | grep: write error
Is this significant?
I understand the difficulty to debug all that (thanks for the details you provided) since those instances are very volatile.
Yes, I think the results you get are a bit better (but I still really wonder why you are getting those errors). Maybe try to increase those values again a little bit (to create bigger and less partitions) ?
I've looked at GNU grep a bit and found this :
http://git.savannah.gnu.org/cgit/grep.git/tree/NEWS#n233
Maybe (if possible on your Amazon instances) you could try to upgrade grep to (at least) v2.25 to see if we can get more details ?
Anyway, those errors are related to a write error to stdout :
http://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c#n1277
We just need to know what the exact problem is :p
Great suggestions!
I have upgraded both grep and fpart/fpsync on all the machines and I'm trying it again. The grep version is 3.4. The fpart/fpsync version is now 1.2.
Hopefully, we'll get better diagnostics now.
I've also upgraded the OS from Amazon Linux to Amazon Linux 2. Dunno if that will make a difference. Had to do that to get the latest fpart/fpsync version.
Are there any guidelines for the size of the /tmp directory we should use? Remember that we are copying files that add up to terabytes. Does that mean /tmp/fpsync should be a very large filesystem too?
Just confirmed we are getting errors again. Here is the output in the log:
| 2020-07-27T10:37:55.590-05:00 | 1595864270 => [QMGR] Starting job /tmp/fpsync/work/1595853730-8707/3422 (local)
| 2020-07-27T10:39:28.387-05:00 | 1595864367 <= [QMGR] Job 1759:3403:local finished
| 2020-07-27T10:39:28.387-05:00 | grep: write error: Broken pipe
| 2020-07-27T10:39:33.590-05:00 | 1595864368 => [QMGR] Starting job /tmp/fpsync/work/1595853730-8707/3423 (local)
| 2020-07-27T10:40:13.915-05:00 | 1595864413 <= [QMGR] Job 10103:3378:local finished
| 2020-07-27T10:40:13.915-05:00 | grep: write error: Broken pipe
Here is the invocation of FPSync:
/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 500000 -vvvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/
This is using FPSync 1.2 and grep 3.4.
Hi,
No, /tmp/fpsync should not be that big : it should only be able to hold partitions (so you may need some free inodes, but not a lot of space).
Thanks for having updated grep! So, finally, we have confirmation that grep also fails due to a broken pipe. EPIPE means there is no process to read the data. In our case, most probably 'head' having finished its work and exited before grep.
Re-reading fpsync code and your original logs (from the zip file), I think your are safe here as we can see that the next job has been picked up correctly (512, then 513, ...) so you can ignore those stderr warnings. This should probably not appear to the end-user, so I'll fix the code ASAP.
Thanks a lot for your report (and thanks for using fpart/fpsync :)) !
Ganael.
Should be fixed now (see commit https://github.com/martymac/fpart/commit/996e890b1f52e2254415bacf13d502473619ee5c), thanks again!
No problem. Thank you for all your help.
Hi,
I'm getting the following errors running fpsync
Background: