glyustb / mogilefs

Automatically exported from code.google.com/p/mogilefs
0 stars 0 forks source link

mogile fsck runs over 100% (runs over initial estimated fid count ) #50

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.started the mogadm fsck start
2. increased the fsck worker count to 5 from our default 3
3.initial mogadm fsck status was

mogadm fsck  status

    Running: No
     Status: 62015847 / 83452559 (74.31%)
       Time: 1m (795075 fids/s; 26s remain)
 Check Type: Normal (check policy + files)

 [num_NOPA]: 108
 [num_POVI]: 2752197
 [num_REPL]: 2752197
 [num_SRCH]: 108

Notice : Status: 62015847 / 83452559 (74.31%)

after running FSCK  for over 2 weeks this is what mogadm fsck status would 
return

mogadm fsck status

    Running: Yes (on lfvsfcp10058.dn.net)
     Status: 147506157 / 83452559 (176.75%)
       Time: 11590m (212 fids/s; -301994s remain)
 Check Type: Normal (check policy + files)

 [num_MISS]: 7494016
 [num_NOPA]: 167
 [num_POVI]: 2752197
 [num_REPL]: 10246153
 [num_SRCH]: 167

Notice :   Status: 147506157 / 83452559 (176.75%)

What version of the product are you using? On what operating system?

latest mogilefs  2.55 (with updated perlball, updated mysql schema using 
mogdbsetup ) Red Hat Enterprise Linux Server release 5.5 . x86_64 .  RAM 8 GB , 
Hex  core cpu's , OS is on raid 1+0 and mogile data disk is on hardware RAID 0 
. running 2 trackers on 2 hosts with mysql in master-master replication .

Please provide any additional information below.

I sent a mail to mogile group about the issue 

I started FSCK check on 22 Nov 2011 18:05:22 IST and the status at the start 
time looked like 

    Running: Yes
     Status: 55252778 / 75053798 (73.61%)
       Time: 791m (1164 fids/s; 19801020m remain)
 Check Type: Normal (check policy + files)

 [num_GONE]: 1
 [num_NOPA]: 1
 [num_POVI]: 365
 [num_REPL]: 365
 [num_SRCH]: 1

and increased the worker count to 10

today when I check the status it looks like 
Running: Yes
     Status: 127372517 / 83452559 (152.63%)
       Time: 8732m (243 fids/s; -180665s remain)
 Check Type: Normal (check policy + files)

 [num_MISS]: 6447188
 [num_NOPA]: 163
 [num_POVI]: 2752197
 [num_REPL]: 9199319
 [num_SRCH]: 163

if I interpret everything right that's a lot of missing files ( [num_MISS]: 
6447188 ) and num_POVI

I thought the check would stop at 100% .  what should I understand by Time: 
8732m (243 fids/s; -180665s remain) ?? anything else that I should worry about 
the current status as of today ?

and got a reply from Dormando saying 

"You upgraded to 2.55 right? I forget...

The limit there is an estimate of how many FID's you have, then it just
keeps going until it runs out. It looks like you're doing a massive upload
right now... and FSCK is looking at your fids while they're in the middle
of being replicated.

We should probably open a bug and ensure FSCK stops after it hits its
initial estimated top fid (probably with a note about how many were
uploaded in the meantime), and you should probably *stop* that FSCK and
run it again when you're not uploading as fast."

Thanks
Tariq

Original issue reported on code.google.com by ganaiw...@gmail.com on 13 Dec 2011 at 7:58

GoogleCodeExporter commented 8 years ago
I see no update yet but sharing the latest status ..

I had to reset fsck and start it from the scratch to see if that makes any 
difference ,for some reason I noticed that on mogilefs 2.55 the job reaper 
would go in hung state and new fsck start would just not happen , it would sit 
at 0 % progress until you kill mogilefsd process and start it again and then 
start the new fsck from ground zero . that's what i did almost 2+  weeks back , 
this is what it looked like

[root@lfvcp1**57 ~]# mogadm fsck status

    Running: Yes (on lfvsfcp1**58.dn.net)
     Status: 6623 / 287432015 (0.00%)
       Time: 1m (72 fids/s; 65820m remain)
 Check Type: Normal (check policy + files)

 [num_GONE]: 2
 [num_MISS]: 2574
 [num_NOPA]: 2
 [num_POVI]: 2998
 [num_REPL]: 2998
 [num_SRCH]: 2

[root@lfcp1**57 ~]#         16th Jan

and this is how the fsck status looks like now after 17 days 

Running: Yes (on lfvfcp1**57.dn.net)
     Status: 290466589 / 287432015 (101.06%)
       Time: 2798m (1730 fids/s; -1753s remain)
 Check Type: Normal (check policy + files)

 [num_BLEN]: 67
 [num_GONE]: 25279
 [num_MISS]: 20222362
 [num_NOPA]: 25348
 [num_POVI]: 2175420
 [num_REPL]: 21569682
 [num_SRCH]: 25356

I fail to understand how mogilefs hits 101.06%  in status ? it cant reach and 
complete at the initial estimated fid count .

Original comment by ganaiw...@gmail.com on 1 Feb 2012 at 12:40

GoogleCodeExporter commented 8 years ago
Have you upgraded to 2.56 yet? (now 2.57). The hang bug has been fixed.

Again; FSCK can go above 100% because it looks at MAX(fid) when you *start* 
FSCK. by the time it gets to the end point, it checks files which had been 
uploaded after FSCK was started. So then it ends up counting above 100%.

whenever we get time to fix this bug, it will stop where you expect it to, 
instead of checking all files.

Original comment by dorma...@rydia.net on 1 Feb 2012 at 6:30

GoogleCodeExporter commented 8 years ago
Hi,

I tried to update to 2.57 with cpan but would see these errors at 'make
test' for MogileFS::Server

  DORMANDO/MogileFS-Server-2.57.tar.gz
  /usr/bin/make -- OK
Running make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e"
"test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-startup.t .............. skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/01-domain-class.t ......... skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/02-host-device.t .......... skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/10-weighting.t ............ skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/20-filepaths.t ............ skipped: Filepaths plugin has been separated
from the server, a bit of work is needed to make the tests run again.
t/30-rebalance.t ............ skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/fid-stat.t ................ ok
t/mogstored-shutdown.t ...... 1/4
#   Failed test 'started daemonized mogstored'
#   at t/mogstored-shutdown.t line 25.
Use of uninitialized value in concatenation (.) or string at
t/mogstored-shutdown.t line 30.
exist =
wasn't able to start up. at t/mogstored-shutdown.t line 35.
# Looks like you planned 4 tests but ran 1.
# Looks like you failed 1 test of 1 run.
# Looks like your test exited with 9 just after 1.
t/mogstored-shutdown.t ...... Dubious, test returned 9 (wstat 2304, 0x900)
Failed 4/4 subtests
t/multiple-hosts-replpol.t .. ok
t/replpolicy-parsing.t ...... ok
t/replpolicy.t .............. ok
t/store.t ................... skipped: Can't create temporary test
database: Failed to connect to database: Can't connect to local MySQL
server through socket '/var/lib/mysql/mysql.sock' (2) at
/home/root2/.cpan/build/MogileFS-Server-2.57-8aPnJM/blib/lib/MogileFS/Store.pm
line 370.
t/util.t .................... ok

Test Summary Report
-------------------
t/mogstored-shutdown.t    (Wstat: 2304 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 9
  Parse errors: Bad plan.  You planned 4 tests but ran 1.
Files=13, Tests=68,  3 wallclock secs ( 0.05 usr  0.02 sys +  1.63 cusr
0.29 csys =  1.99 CPU)
Result: FAIL
Failed 1/13 test programs. 1/68 subtests failed.
make: *** [test_dynamic] Error 255
  DORMANDO/MogileFS-Server-2.57.tar.gz
  /usr/bin/make test -- NOT OK
//hint// to see the cpan-testers results for installing this module, try:
  reports DORMANDO/MogileFS-Server-2.57.tar.gz
Running make install
  make test had returned bad status, won't install without force
Failed during this command:
 DORMANDO/MogileFS-Server-2.57.tar.gz         : make_test NO

cpan[2]>
guess it's trying to connect to MySQL via default socket and failing which
is different in my case like mysql99.sock .

However I manually ran make/make install and the upgrade seemed to work
fine shows as 2.57 on telnetting the tracker .

Original comment by ganaiw...@gmail.com on 13 Feb 2012 at 8:19

GoogleCodeExporter commented 8 years ago
The only error in there:

t/mogstored-shutdown.t ...... 1/4
#   Failed test 'started daemonized mogstored'
#   at t/mogstored-shutdown.t line 25.
Use of uninitialized value in concatenation (.) or string at
t/mogstored-shutdown.t line 30.
exist =
wasn't able to start up. at t/mogstored-shutdown.t line 35.

which is because you're running "make test" on a host which already *has* a 
mogstored running on it. You can't run make test on a host with mogilefsd 
running on it already, not reliably.

Original comment by dorma...@rydia.net on 14 Feb 2012 at 3:21

GoogleCodeExporter commented 8 years ago
sure I understand , however I assure I had shutdown mogilefsd/mogstored
before I tried to upgrade .., it failed at the same step twice
with mogilefsd/mogstored on and without .. Other than that  I see the
upgrade has worked out just fine , no issues .\

Thanks for your support !

Original comment by ganaiw...@gmail.com on 14 Feb 2012 at 2:45

GoogleCodeExporter commented 8 years ago
The issue described in the topic was actually fixed in 2.61, along with tons of 
other fsck things.

closing this now.

Original comment by dorma...@rydia.net on 20 Jun 2012 at 12:48