martymac / fpart

Sort files and pack them into partitions
https://www.fpart.org/
BSD 2-Clause "Simplified" License
231 stars 39 forks source link

Getting 'grep' and 'ls' errors on fpsync #22

Closed gamename closed 4 years ago

gamename commented 4 years ago

Hi,

I'm getting the following errors running fpsync

  | 2020-07-21T04:46:39.467-05:00 | grep: write error
  | 2020-07-21T04:46:39.467-05:00 | ls: write error: Broken pipe
  | 2020-07-21T04:46:44.790-05:00 | => [QMGR] Starting job /tmp/fpsync/work/1595253731-3171/1989 (local)
  | 2020-07-21T04:46:58.236-05:00 | <= [QMGR] Job 32760:local finished
  | 2020-07-21T04:46:58.236-05:00 | grep: write error
  | 2020-07-21T04:46:58.236-05:00 | ls: write error: Broken pipe
  | 2020-07-21T04:47:03.790-05:00 | => [QMGR] Starting job /tmp/fpsync/work/1595253731-3171/1990 (local)
  | 2020-07-21T04:47:29.255-05:00 | <= [QMGR] Job 14185:local finished
  | 2020-07-21T04:47:29.255-05:00 | grep: write error

Background:

  1. Here is the fpsync invocation:
    /usr/bin/fpsync -O -X /ebs/mvprd/.glusterfs -X ./.root -n 64 -vvv /ebs/ /efs/ef-wc-mhiv-l-dev1/vault_27MAY/
  2. We are copying about 3TB of files
  3. The errors do not happen immediately. It is far into the process before they start appearing.
  4. Here is some additional info about the host on which this is running:
       [ec2-user@ip-10-64-10-215 ~]$ df -h
       Filesystem       Size  Used Avail Use% Mounted on
       devtmpfs         3.7G   76K  3.7G   1% /dev
       tmpfs            3.7G     0  3.7G   0% /dev/shm
       /dev/xvda1       7.9G  2.7G  5.2G  34% /
       10.64.128.181:/  8.0E  5.5T  8.0E   1% /efs
       /dev/xvdb1       3.0T  3.0T  5.8G 100% /ebs
       [ec2-user@ip-10-64-10-215 ~]$ sudo du -sh /tmp
       1.3G /tmp
martymac commented 4 years ago

Hello Tennis,

It seems fpsync has a problem getting next sync job to do.

Can you tell me what version of fpsync you are using ? Also, can you try to run fpsync in trace mode :

$ sh -x $(which fpsync) [your options] 1>/tmp/out.log 2>&1

and look at /tmp/out.log to see when the problem is triggered ?

Can you also check that your tmp dir (/tmp/fpsync by default) is OK when the problem appears ?

gamename commented 4 years ago

Hi Ganael,

Thank you for the quick reply. I am using version 1.0.

[ec2-user@ip-10-110-20-85 ~]$ fpsync --version
/usr/bin/fpsync: illegal option -- -
fpsync v1.0.0 - Sync directories in parallel using fpart and rsync
Copyright (c) 2014-2017 Ganael LAPLANCHE <ganael.laplanche@martymac.org>
WWW: http://contribs.martymac.org

I am working on getting a trace file for you.

As for the /tmp/fpsync directory, it appears ok (below). Is there anything in particular I should be looking for?:

[ec2-user@ip-10-110-20-85 tmp]$ ls -la /tmp/fpsync
total 24
drwxr-xr-x 6 root root 4096 Jul 21 13:49 .
drwxrwxrwt 4 root root 4096 Jul 22 03:49 ..
drwxr-xr-x 3 root root 4096 Jul 21 13:49 log
drwxr-xr-x 3 root root 4096 Jul 21 13:49 parts
drwxr-xr-x 3 root root 4096 Jul 21 13:49 queue
drwxr-xr-x 3 root root 4096 Jul 21 13:49 work

Thanks -Tennis

gamename commented 4 years ago

@martymac Here is an fpsync log for you fpsync.log.zip

gamename commented 4 years ago

@martymac Could you have a look at this, please? I'm afraid this is now a production problem.
Thanks for your time, -Tennis

martymac commented 4 years ago

Hello Tennis,

Sorry for lagging. I was looking for I/O errors on your /tmp, but that does not seem to be the case. Unfortunately, the logs you uploaded do not provide much more information. We can see that everything seems to work well until grep starts to throw write errors and further both grep and ls. That seem to be a system-related error ; maybe you are running out of memory ? Do you see something in your kernel logs ? Maybe a process in the pipe has been OOM killed ?

gamename commented 4 years ago

Hi Ganael,

No problem. Thanks for getting back to me. I'll look into those questions. Also, what is the impact of these errors? Do they prevent data from being copied?

martymac commented 4 years ago

Yes, data will probably not be copied properly as that part of the code is related to job queue management.

Moreover, if your system is missing memory, it will probably not be able to cope with 64 rsync processes in parallel. Try to reduce the number of parallel jobs (option -n) as well as increasing their size by playing with per-job max-size/file number limits (options -f and -s) to see what happens. Finally (and the most important), try to monitor your system once the above errors appear.

Hope this helps, Best regards,

Ganael

gamename commented 4 years ago

Thanks! We will try those changes.

gamename commented 4 years ago

@themedina FYI

TheMedina commented 4 years ago

@martymac - I'm working with @gamename and we just completed our FPSync job (took about 48 hours for 7390/7390 parts), and the last log line states that FPSync completed without errors. This is odd to us, we very clearly saw errors when monitoring the job. Additionally, we monitored via HTOP while the job was running, and memory was never an issue on our EC2 C4.2XL. Let me know your thoughts.

Thanks you,

martymac commented 4 years ago

Before exiting, fpsync checks for every stderr output from all sync jobs it has run. If you got no errors (such as here), you can be sure that every job ran without error, but as the errors you get during sync are (I presume) located in the loop that picks jobs :

https://github.com/martymac/fpart/blob/master/tools/fpsync#L540

some jobs may have not run at all.

The question is why does grep throw a write error here (breaking the pipe you see in the code above) ? If this is not related to missing memory, maybe you are hitting a system limit (such as max pipe size, see ulimit -a for tunable limits) ? To debug that, you can try to reproduce the problem once that limit seems to be reached (try to run the command in the pipe by hand) and see what the problem is.

Another way to (maybe) avoid that limit would be, as suggested, to reduce the number of partitions produced (see my previous answer).

Best regards, Ganael.

gamename commented 4 years ago

Well, reducing the number of threads from 64 to 32 does not help. We still get the same errors. I'll try reducing the number of partitions next.

@martymac, in your last response you mentioned reproducing the issue by running the pipe by hand. What do you have in mind? How would I do that?

Also note that we are using version 1.0 of the code. The repository version is 1.2. Are there any changes in 1.2 that could help?

@TheMedina FYI

martymac commented 4 years ago

Hi,

I meant just try running :

cd "${JOBS_QUEUEDIR}" && ls -rt1 | grep -v "info" | head -n 1

and try to see what breaks. ($JOBS_QUEUEDIR usually being /tmp/fpsync/queue/<job name>).

You can try version 1.2 but that part of the code has not been modified in between.

Best regards,

Ganael.

gamename commented 4 years ago

No luck. I'm running with this combination (below) and I'm still seeing the same errors:

/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 10000 -vvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/

I'll try the manual command @martymac suggests next.

gamename commented 4 years ago

I tried the manual command and nothing seems wrong:

[ec2-user@ip-10-80-20-26 ~]$ export JOBS_QUEUEDIR=/tmp/fpsync/queue/1595670573-23817/
[ec2-user@ip-10-80-20-26 ~]$ cd "${JOBS_QUEUEDIR}" && ls -rt1 | grep -v "info" | head -n 1
2561
martymac commented 4 years ago

Hello,

Sorry, as I am unable to reproduce the problem, it's quite difficult to diagnose it here (I've never been reported that issue before, especially for that command pipe that should just go well ; that's why I really suspect a system issue here).

Anyway, I've seen you limited the number of files (-f option), maybe you can also try to limit the size of partitions (-s option) too ? The goal is to limit the final number of partitions produced to see if your problem is related to a too high number of partitions being listed in the queue directory.

Also, if you can access your src/ and dest/ FS from another host, maybe you can try to run the sync process from there ?

Finally, as you suggested, you can try to upgrade to latest fpart+fpsync version (1.2.0). It should not especially help for your problem, but you will gain new features (and bugfixes) as well.

Best regards,

Ganael.

gamename commented 4 years ago

Hi Ganael,

Thank you for your continued support. Its really appreciated. 👍

Like you, I suspect the system could be the problem. I just can't prove it so far. I will keep researching this angle too.

Per your suggestion, I will try limiting the partition sizes next (the -s option).

Here is a little more about our design. This is all taking place on AWS. We are copying data from EBS volumes to EFS (NFS) volumes. The instances doing the copying are C4.2xlarge. They are only used for copying one time, then they self-terminate.

The progression goes like this:

  1. A system is instantiated
  2. A single EBS volume is attached
  3. A single EFS volume is attached
  4. FPSync copies data from EBS to EFS
  5. The system self-terminates

Instances are never re-used. They are used to drive one copy process and then terminated. We are seeing the same issue on multiple systems. Again, each system is new. But they all share a common configuration (cpu size, memory, etc).

Also the amount of data being copied is very large. Multiple terabytes is common. For example, here is 'df -h' output from one copy instance:

[root@ip-10-80-20-154 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.4G   76K  7.4G   1% /dev
tmpfs           7.4G     0  7.4G   0% /dev/shm
/dev/xvda1       99G  5.0G   94G   6% /
10.88.39.137:/  8.0E   24T  8.0E   1% /efs
/dev/xvdb1       16T   13T  3.5T  79% /ebs

In the above display, the contents of /ebs is being copied to /efs. Notice that /ebs is 16TB.

-Tennis

gamename commented 4 years ago

Possibly some good news. I ran this combination of options (note the -f value is one hundred thousand):

/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 100000 -vvvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/

I still get errors, but now I'm only getting 1 error. The second ls error has disappeared (below).

2020-07-26T07:37:52.443-05:00 | => [QMGR] Starting job /tmp/fpsync/work/1595752557-23797/2792 (local)
-- | --
  | 2020-07-26T07:43:23.599-05:00 | <= [QMGR] Job 28193:local finished
  | 2020-07-26T07:43:23.599-05:00 | grep: write error

Is this significant?

martymac commented 4 years ago

I understand the difficulty to debug all that (thanks for the details you provided) since those instances are very volatile.

Yes, I think the results you get are a bit better (but I still really wonder why you are getting those errors). Maybe try to increase those values again a little bit (to create bigger and less partitions) ?

I've looked at GNU grep a bit and found this :

http://git.savannah.gnu.org/cgit/grep.git/tree/NEWS#n233

Maybe (if possible on your Amazon instances) you could try to upgrade grep to (at least) v2.25 to see if we can get more details ?

Anyway, those errors are related to a write error to stdout :

http://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c#n1277

We just need to know what the exact problem is :p

gamename commented 4 years ago

Great suggestions!

I have upgraded both grep and fpart/fpsync on all the machines and I'm trying it again. The grep version is 3.4. The fpart/fpsync version is now 1.2.

Hopefully, we'll get better diagnostics now.

I've also upgraded the OS from Amazon Linux to Amazon Linux 2. Dunno if that will make a difference. Had to do that to get the latest fpart/fpsync version.

gamename commented 4 years ago

Are there any guidelines for the size of the /tmp directory we should use? Remember that we are copying files that add up to terabytes. Does that mean /tmp/fpsync should be a very large filesystem too?

gamename commented 4 years ago

Just confirmed we are getting errors again. Here is the output in the log:

  | 2020-07-27T10:37:55.590-05:00 | 1595864270 => [QMGR] Starting job /tmp/fpsync/work/1595853730-8707/3422 (local)
  | 2020-07-27T10:39:28.387-05:00 | 1595864367 <= [QMGR] Job 1759:3403:local finished
  | 2020-07-27T10:39:28.387-05:00 | grep: write error: Broken pipe
  | 2020-07-27T10:39:33.590-05:00 | 1595864368 => [QMGR] Starting job /tmp/fpsync/work/1595853730-8707/3423 (local)
  | 2020-07-27T10:40:13.915-05:00 | 1595864413 <= [QMGR] Job 10103:3378:local finished
  | 2020-07-27T10:40:13.915-05:00 | grep: write error: Broken pipe

Here is the invocation of FPSync:

/usr/bin/fpsync -O -X /ebs/geisprd/.glusterfs -X ./.root -n 32 -f 500000 -vvvv /ebs/geisprd/ /efs/nv-wc-geis-prd/vault12/

This is using FPSync 1.2 and grep 3.4.

martymac commented 4 years ago

Hi,

No, /tmp/fpsync should not be that big : it should only be able to hold partitions (so you may need some free inodes, but not a lot of space).

Thanks for having updated grep! So, finally, we have confirmation that grep also fails due to a broken pipe. EPIPE means there is no process to read the data. In our case, most probably 'head' having finished its work and exited before grep.

Re-reading fpsync code and your original logs (from the zip file), I think your are safe here as we can see that the next job has been picked up correctly (512, then 513, ...) so you can ignore those stderr warnings. This should probably not appear to the end-user, so I'll fix the code ASAP.

Thanks a lot for your report (and thanks for using fpart/fpsync :)) !

Ganael.

martymac commented 4 years ago

Should be fixed now (see commit https://github.com/martymac/fpart/commit/996e890b1f52e2254415bacf13d502473619ee5c), thanks again!

gamename commented 4 years ago

No problem. Thank you for all your help.