Open irvintim opened 9 years ago
Hello.
This happens about once every dozen or so retrieval requests.
Not too rare and nobody else noticed that.
Could this be hardware issue (memory, network)? Did you test memory, did you tried different machine?
Anyway, could you pls give
1) your OS, distro and version
2) perl -V
(notice capital "V")
3) mtglacier version
4) mtglacier --version
output
That's a good thought about testing the memory -- I'll schedule a maintenance window.
Here are the details you requested:
11:12 AM [root@storage01 ~] more /etc/redhat-release CentOS release 6.6 (Final) 12:54 PM [root@storage01 ~] perl -V Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
Platform: osname=linux, osvers=2.6.32-220.el6.x86_64, archname=x86_64-linux-thread-multi uname='linux c6b9.bsys.dev.centos.org 2.6.32-220.el6.x86_64 #1 smp tue dec 6 19:48:22 gmt 2011 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -DDEBUGGING=-g -Dversion=5.10.1 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Darchlib=/usr/lib64/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib64/perl5/vendor_perl -Dinc_version_list=5.10.0 -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.7 20120313 (Red Hat 4.4.7-16)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags =' -fstack-protector' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.12' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib64/perl5/CORE' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic'
Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API USE_SITECUSTOMIZE Built under linux Compiled at Jul 24 2015 01:51:51 @INC: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . 12:54 PM [root@storage01 ~] uname -a Linux storage01.sfo.metainterfaces.com 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux 12:54 PM [root@storage01 ~] mtglacier --version MT-AWS-Glacier, Copyright 2012-2014 Victor Efimov http://mt-aws.com/ Version 1.120
mt-aws-glacier version: 1.120 Perl Version: 5.010001 AutoLoader 5.68 /usr/share/perl5/AutoLoader.pm Carp 1.11 /usr/share/perl5/Carp.pm Carp::Heavy undef /usr/share/perl5/Carp/Heavy.pm Class::Struct 0.63 /usr/share/perl5/Class/Struct.pm Config undef /usr/lib64/perl5/Config.pm Cwd 3.3 /usr/lib64/perl5/Cwd.pm Digest::SHA 5.47 /usr/lib64/perl5/Digest/SHA.pm Digest::base 1.16 /usr/share/perl5/Digest/base.pm DynaLoader 1.10 /usr/lib64/perl5/DynaLoader.pm Encode 2.35 /usr/lib64/perl5/Encode.pm Encode::Alias 2.12 /usr/lib64/perl5/Encode/Alias.pm Encode::Config 2.05 /usr/lib64/perl5/Encode/Config.pm Encode::Encoding 2.05 /usr/lib64/perl5/Encode/Encoding.pm Errno 1.11 /usr/lib64/perl5/Errno.pm Exporter 5.63 /usr/share/perl5/Exporter.pm Exporter::Heavy 5.63 /usr/share/perl5/Exporter/Heavy.pm Fcntl 1.06 /usr/lib64/perl5/Fcntl.pm File::Basename 2.77 /usr/share/perl5/File/Basename.pm File::Find 1.14 /usr/share/perl5/File/Find.pm File::Path 2.08 /usr/share/perl5/File/Path.pm File::Spec 3.3 /usr/share/perl5/File/Spec.pm File::Spec::Unix 3.3 /usr/share/perl5/File/Spec/Unix.pm File::Temp 0.22 /usr/share/perl5/File/Temp.pm File::stat 1.01 /usr/share/perl5/File/stat.pm FileHandle 2.02 /usr/share/perl5/FileHandle.pm Getopt::Long 2.38 /usr/share/perl5/Getopt/Long.pm HTTP::Date 5.831 /usr/share/perl5/HTTP/Date.pm HTTP::Headers 5.827 /usr/share/perl5/HTTP/Headers.pm HTTP::Message 5.832 /usr/share/perl5/HTTP/Message.pm HTTP::Request 5.827 /usr/share/perl5/HTTP/Request.pm HTTP::Response 5.824 /usr/share/perl5/HTTP/Response.pm HTTP::Status 5.817 /usr/share/perl5/HTTP/Status.pm I18N::Langinfo 0.02 /usr/lib64/perl5/I18N/Langinfo.pm IO 1.25 /usr/lib64/perl5/IO.pm IO::File 1.14 /usr/lib64/perl5/IO/File.pm IO::Handle 1.28 /usr/lib64/perl5/IO/Handle.pm IO::Pipe 1.13 /usr/lib64/perl5/IO/Pipe.pm IO::Seekable 1.1 /usr/lib64/perl5/IO/Seekable.pm IO::Select 1.17 /usr/lib64/perl5/IO/Select.pm JSON::XS 2.27 /usr/lib64/perl5/JSON/XS.pm LWP 5.833 /usr/share/perl5/LWP.pm LWP::MemberMixin undef /usr/share/perl5/LWP/MemberMixin.pm LWP::Protocol 5.829 /usr/share/perl5/LWP/Protocol.pm LWP::UserAgent 5.833 /usr/share/perl5/LWP/UserAgent.pm List::Util 1.21 /usr/lib64/perl5/List/Util.pm MIME::Base64 3.08 /usr/lib64/perl5/MIME/Base64.pm POSIX 1.17 /usr/lib64/perl5/POSIX.pm PerlIO::encoding 0.11 /usr/lib64/perl5/PerlIO/encoding.pm Scalar::Util 1.21 /usr/lib64/perl5/Scalar/Util.pm SelectSaver 1.02 /usr/share/perl5/SelectSaver.pm Storable 2.20 /usr/lib64/perl5/Storable.pm Symbol 1.07 /usr/share/perl5/Symbol.pm Tie::Hash 1.03 /usr/share/perl5/Tie/Hash.pm Time::Local 1.1901 /usr/share/perl5/Time/Local.pm URI 1.40 /usr/share/perl5/URI.pm URI::Escape 3.29 /usr/share/perl5/URI/Escape.pm XSLoader 0.10 /usr/lib64/perl5/XSLoader.pm attributes 0.09 /usr/share/perl5/attributes.pm base 2.14 /usr/share/perl5/base.pm bytes 1.03 /usr/share/perl5/bytes.pm common::sense 3.5 /usr/share/perl5/vendor_perl/common/sense.pm constant 1.17 /usr/share/perl5/constant.pm integer 1.00 /usr/share/perl5/integer.pm overload 1.07 /usr/share/perl5/overload.pm re 0.09 /usr/lib64/perl5/re.pm strict 1.04 /usr/share/perl5/strict.pm utf8 1.07 /usr/share/perl5/utf8.pm vars 1.01 /usr/share/perl5/vars.pm warnings 1.06 /usr/share/perl5/warnings.pm warnings::register 1.01 /usr/share/perl5/warnings/register.pm OK DONE 12:54 PM [root@storage01 ~]
On Tue, Sep 15, 2015 at 12:38 PM, Victor Efimov notifications@github.com wrote:
Hello.
This happens about once every dozen or so retrieval requests.
Not too rare and nobody else noticed that.
Could this be hardware issue (memory, network)? Did you test memory, did you tried different machine?
Anyway, could you pls give
1) your OS, distro and version 2) perl -V (notice capital "V") 3) mtglacier version 4) mtglacier --version output
— Reply to this email directly or view it on GitHub https://github.com/vsespb/mt-aws-glacier/issues/115#issuecomment-140513528 .
Tim Irvin NetTempo, Inc. 130 Battery St., Suite 500 San Francisco, CA 94111 +1 415 992-4902 voice/fax http://www.nettempo.com
I already tested/developed with that version of perl. Don't see anything suspicious in module versions too. Let's wait memory test.
Hi Victor,
Well, no joy on the memory check. The RAM checked out fine on the server.
I have a theory, see if you think there could be any correlation.
The machine that is doing the restores is also pushing new content to glacier -- so there can be multiple mtglacer commands running at the same time, they would be operating on different journal files however.
What we do when we schedule something for restore is create a new journal file with just the relevant "CREATED" line out of journal.log into a new file, and then run:
mtglacier restore .....
using the new journal file.
Then after 4 hours we run:
mtglacier restore-completed ....
on that journal file to download the archive. If the file fails to be placed on the disk when this command has finished running, then the "mtglacier restore-completed ...." will be run again at the top of the next hour -- this will repeat for 20 hours total and if it fails to download then this archive will be placed back in the queue of things to download and the whole process will start again.
It appears that the failure in downloading occurs mostly (perhaps solely, but I haven't confirmed that yet) when a different process on the same machine is uploading archives to glacier at the same time, also using mtglacier (but to reiterate, it's working off a different journal file).
Tim
On Tue, Sep 15, 2015 at 1:03 PM, Victor Efimov notifications@github.com wrote:
I already tested/developed with that version of perl. Don't see anything suspicious in module versions too. Let's wait memory test.
— Reply to this email directly or view it on GitHub https://github.com/vsespb/mt-aws-glacier/issues/115#issuecomment-140520950 .
Tim Irvin NetTempo, Inc. 130 Battery St., Suite 500 San Francisco, CA 94111 +1 415 992-4902 voice/fax http://www.nettempo.com
ok.. let me think..
It appears that the failure in downloading occurs mostly (perhaps solely, but I haven't confirmed that yet) when a different process on the same machine is uploading archives
Would it be true if we say that failure happens when overall machine load is high (i.e. not too much free memory, pretty much of disk activity etc)?
Also - the drive where you donwload files - is this a network drive? What technology?
I think I can eliminate the theory about other mtglacier apps running at the same time, looking at the logs that job finished around 8:30am today and I've had more restore failures since then.
Looking at the machine load, it's not very high.
The machine's load average is 0.33 It has 24 cores, noe are over 5% utililization. The machine has 24 GB or ramand 8GB are in use, with 4GB of that being used for disk caching. The data drive used is a LUN from a local JBOD SCSI disk array, via an Adaptec RAID controller -- it is a stripe of 2 3 TB drives drives into a RAID0 LUN of 5.5 TB. The filesystem is XFS.
Here is the current iostat on that partition (5 second intervals):
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.60 0.00 14.40 0 72
On Wed, Sep 16, 2015 at 2:36 PM, Victor Efimov notifications@github.com wrote:
It appears that the failure in downloading occurs mostly (perhaps solely, but I haven't confirmed that yet) when a different process on the same machine is uploading archives
Would it be true if we say that failure happens when overall machine load is high (i.e. not too much free memory, pretty much of disk activity etc)?
Also - the drive where you donwload files - is this a network drive? What technology?
— Reply to this email directly or view it on GitHub https://github.com/vsespb/mt-aws-glacier/issues/115#issuecomment-140901447 .
Tim Irvin NetTempo, Inc. 130 Battery St., Suite 500 San Francisco, CA 94111 +1 415 992-4902 voice/fax http://www.nettempo.com
I tried simulate the error. I only get this error if data corruption happens. I simulated like this:
--- a/lib/App/MtAws/GlacierRequest.pm
+++ b/lib/App/MtAws/GlacierRequest.pm
@@ -522,6 +522,7 @@ sub perform_lwp
die exception "unknow_error" => "Unknown error, probably LWP version is too old";
}
}
+ substr($_[0],1,1)='x' if rand() < 0.9;
$self->{writer}->add_data($_[0]);
1;
});
Other errors - like data truncation, wrong HTTP headers, error in creating temp files during download does not produce such error (i.e. all works as expected).
If you use HTTP could you try HTTPS, and if you use HTTPS could you try HTTP ? Let's see how close the bug to networking library.
We use HTTP, I had to install LWP::Protocol::https from CPAN, along with a bunch of other things, dev tools, Net::SSLeay, gcc, openssl-devel, etc... in order to switch to HTTPS.
The first test succeeded after switching, I'm running some more test restores today and will let you know how they go.
Thanks for all the help!
Tim
On Thu, Sep 17, 2015 at 5:26 AM, Victor Efimov notifications@github.com wrote:
I tried simulate the error. I only get this error if data corruption happens. I simulated like this:
--- a/lib/App/MtAws/GlacierRequest.pm +++ b/lib/App/MtAws/GlacierRequest.pm @@ -522,6 +522,7 @@ sub perform_lwp die exception "unknow_error" => "Unknown error, probably LWP version is too old"; } }
- substr($_[0],1,1)='x' if rand() < 0.9; $self->{writer}->adddata($[0]); 1; });
Other errors - like data truncation, wrong HTTP headers, error in creating temp files during download does not produce such error (i.e. all works as expected).
If you use HTTP could you try HTTPS, and if you use HTTPS could you try HTTP ? Let's see how close the bug to networking library.
— Reply to this email directly or view it on GitHub https://github.com/vsespb/mt-aws-glacier/issues/115#issuecomment-141064078 .
Tim Irvin NetTempo, Inc. 130 Battery St., Suite 500 San Francisco, CA 94111 +1 415 992-4902 voice/fax http://www.nettempo.com
yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6
Thanks. Since we didn't use HTTPS I hadn't gone through those steps before. But it's all in place now.
On Thu, Sep 17, 2015 at 10:26 AM, Victor Efimov notifications@github.com wrote:
yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6
— Reply to this email directly or view it on GitHub https://github.com/vsespb/mt-aws-glacier/issues/115#issuecomment-141158787 .
Tim Irvin NetTempo, Inc. 130 Battery St., Suite 500 San Francisco, CA 94111 +1 415 992-4902 voice/fax http://www.nettempo.com
No errors after 8 restores. What does this tell us? Besides that I'll leave it on HTTPS for now. It would be nice to go back to unencrypted at some point, but this workaround is fine for now. Thanks again for all your help. Tim
---Tim IrvinNetTempo, Inc.130 Battery St., Suite 500San Francisco, CA 94111+1-415-992-4902 voice/faxhttp://www.nettempo.com Sent via the Samsung Galaxy S® 6 edge, an AT&T 4G LTE smartphone-------- Original message -------- From: Victor Efimov notifications@github.com Date: 09/17/2015 10:26 AM (GMT-08:00) To: vsespb/mt-aws-glacier mt-aws-glacier@noreply.github.com Cc: irvintim irvin@nettempo.com Subject: Re: [mt-aws-glacier] retrieval issues: TreeHash doesn't match (#115)
yes, just in case, deploy instructions for centos6 here https://github.com/vsespb/mt-aws-glacier#rhelcentos-6
— Reply to this email directly or view it on GitHub.
What does this tell us?
Bug somewhere in old versionsof LWP::UserAgent shipped with Rhel6 (not necessary HTTP-only; when you installed HTTPS you also updated HTTP part). However bug is 1) intermitent 2) only affect data content, but not data length (no truncation, just damage). There are a lot of bug reports for LWP but I don't see anything like that. Also I checked diff with new versions - nothing like that too. So I still suspect maybe broken router or proxy on your side.
However maybe I'll install CentOS6 on VM and try to stresstest (I have script for that). Let's leave ticket open for now.
In retrieving files from glacier using mtglacier, we often get an error that the TreeHash doesn't match. Our job is configured to retry the retrieval when it fails, so it may get this error 10 or 15 times, and then finally a retrieval will succeed. This happens about once every dozen or so retrieval requests.
The offset in the file where the TreeHash mismatch occurs is in a different location on each failure, and since the file eventually is restored it appears to not be corrupt at glacier. So, something is wrong with the transfer or the TreeHash calculation routine.
ERROR (child 26241): TreeHash for received segment of file "/data/glacier-stg/86346" (position 2550136832, size 134217728) does not match. TreeHash reported by server af0b2467868b783ab298faa03f58133cdff7b88309c0fcb928bae650420d6a8f, Calculated TreeHash bfed84480067e86258b3e14026a8f0d2362286459e31389674b632af9b66348e