minio / minio

MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.
https://min.io/download?license=agpl&platform=linux
GNU Affero General Public License v3.0
47.58k stars 5.49k forks source link

Minio s3 upload unreliable under load conditions #7330

Closed ajaykushwah closed 4 years ago

ajaykushwah commented 5 years ago

I am testing/evaluating minio server under a single node configuration for my on-prem repository of large number of 4MB objects on a RAID enabled file system with 400 TB of space. It is a plain filesystem setup, but my node and the underneath filesystem are built for performance and scalability.

My minio version is VERSION: 2019-01-10T00:21:20Z and I am using libcurl s3 interface to populate these objects.

I have uploaded tens of million of object on this setup successfully, buy I occasionally notice some objects are missing their metadata when I download those objects.

On further analysis, I realized that file containing the object metadata is entirely missing on the underneath backend filesystem. See below.

root@cbs-qa-minio-01:/opt/minio/ajay# ls -l /data/fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data -rw-r--r-- 1 root root 1953354 Feb 5 09:14 /data/fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data root@cbs-qa-minio-01:/opt/minio/ajay# ls -l /data/.minio.sys/buckets/fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data ls: cannot access '/data/.minio.sys/buckets/fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data': No such file or directory

So I checked the minio http logs and also the minio stdout messages that I capture for these objects missing metadata.

What I notice is a pattern where some of the put request take longer than usual time for minio server to respond back (I have seen response time up-to 15 min in some cases). The response in most of these cases is “internal error”.

Client thread gives up after 5 min of inactivity, closes the connection and retries in a separate thread on a new connection. The successful upload finishes successfully as seen from from client side.

Based on the server logs, it seems the second request (retry of the timedout request) finishes before the first request which is still stuck somewhere in the server queue. Later after 10-15 min, the first request finishes and respond with internal error. But the response gets an connection reset error as the client has long closed the first connection.

It seems to me this first request which finishes last somehow invalidates the successful upload of the same object which finished earlier on a retry from the client.

I am expecting that if a particular object PUT has been returned a OK by minio to the client, it should return with all of its metadata during GET.

If there are any race conditions that occur from retires due to network shakiness or server slowness, they should be handled correctly by the server.

Current Behavior

objected download was missing its metadata that was provided during the upload. Data remained oK.

Possible Solution

Steps to Reproduce (for bugs)

I am listing the server log sequence as seen by mnio.

minio http log:

[REQUEST objectAPIHandlers.PutObjectHandler-fm] [154935716.226429] [2019-02-05 08:59:22 +0000] PUT /fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data Host: cbs-qa-minio-01.rtp.openenglab.netapp.com X-Amz-Date: 20190205T085922Z X-Amz-Meta-Md5: yrt2ObJZJj1tEEZJ0VchkA== Content-Length: 1953354 Content-Md5: yrt2ObJZJj1tEEZJ0VchkA== X-Amz-Content-Sha256: 76e50624bc4e4a1d3e7f3364b510cdedfe5d954baaeeb1669502f4f298364330 User-Agent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 Accept: / Authorization: AWS4-HMAC-SHA256 Credential=O33UJS0EK838YTFPB010/20190205/us-east-1/s3/aws4_request,SignedHeaders=content-md5;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-meta-md5,Signature=41bfbeecf237f4d600e5f0865194bef2890d47d 11aeeaf4669816fbc1af5564e

[RESPONSE] [154935716.226429] [2019-02-05 09:14:20 +0000] 500 Internal Server Error X-Amz-Request-Id: 15806C2156F70B1A X-Minio-Deployment-Id: 1b82d251-6af6-4f89-9fe6-bacd439448bf Server: Minio/RELEASE.2019-01-10T00-21-20Z (linux; amd64) Accept-Ranges: bytes Content-Type: application/xml Vary: Origin X-Xss-Protection: 1; mode=block Content-Security-Policy: block-all-mixed-content InternalErrorWe encountered an internal error, please try again./fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data15806C2156F70B1A3L137 ## Context [REQUEST objectAPIHandlers.PutObjectHandler-fm] [154935805.971400] [2019-02-05 09:14:19 +0000] PUT /fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data Host: cbs-qa-minio-01.rtp.openenglab.netapp.com X-Amz-Content-Sha256: 76e50624bc4e4a1d3e7f3364b510cdedfe5d954baaeeb1669502f4f298364330 Authorization: AWS4-HMAC-SHA256 Credential=O33UJS0EK838YTFPB010/20190205/us-east-1/s3/aws4_request,SignedHeaders=content-md5;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-meta-md5,Signature=02209159753904d2fad945a3c6ccb95e2bb0900277086d7d4a9bcf718656d700 User-Agent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 X-Amz-Meta-Md5: yrt2ObJZJj1tEEZJ0VchkA== Content-Length: 1953354 Accept: */* Content-Md5: yrt2ObJZJj1tEEZJ0VchkA== X-Amz-Date: 20190205T091415Z [RESPONSE] [154935805.971400] [2019-02-05 09:14:20 +0000] 200 OK Vary: Origin X-Xss-Protection: 1; mode=block Content-Security-Policy: block-all-mixed-content X-Amz-Request-Id: 15806CF24B22969E X-Minio-Deployment-Id: 1b82d251-6af6-4f89-9fe6-bacd439448bf Etag: "cabb7639b259263d6d104649d1572190" Server: Minio/RELEASE.2019-01-10T00-21-20Z (linux; amd64) Accept-Ranges: bytes [REQUEST objectAPIHandlers.PutObjectHandler-fm] [154935804.729406] [2019-02-05 09:14:07 +0000] PUT /fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data Host: cbs-qa-minio-01.rtp.openenglab.netapp.com X-Amz-Content-Sha256: 76e50624bc4e4a1d3e7f3364b510cdedfe5d954baaeeb1669502f4f298364330 Content-Md5: yrt2ObJZJj1tEEZJ0VchkA== Authorization: AWS4-HMAC-SHA256 Credential=O33UJS0EK838YTFPB010/20190205/us-east-1/s3/aws4_request,SignedHeaders=content-md5;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-meta-md5,Signature=f43d4b498ecae19411adcc950b7fc2b5db20094425355d86e58decb092a13137 User-Agent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 X-Amz-Date: 20190205T090424Z X-Amz-Meta-Md5: yrt2ObJZJj1tEEZJ0VchkA== Content-Length: 1953354 Accept: */* [RESPONSE] [154935804.729406] [2019-02-05 09:14:23 +0000] 500 Internal Server Error X-Amz-Request-Id: 15806CEF66D9F898 X-Minio-Deployment-Id: 1b82d251-6af6-4f89-9fe6-bacd439448bf Server: Minio/RELEASE.2019-01-10T00-21-20Z (linux; amd64) Accept-Ranges: bytes Content-Type: application/xml Vary: Origin X-Xss-Protection: 1; mode=block Content-Security-Policy: block-all-mixed-content InternalErrorWe encountered an internal error, please try again./fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data15806CEF66D9F8983L137 [REQUEST objectAPIHandlers.PutObjectHandler-fm] [154935805.974905] [2019-02-05 09:14:19 +0000] PUT /fst03/78d/pWIuGD53s4x0UrVAwsrYarDTu4c-mRdLoXGOK8WZAXs/V1/0dbfb231-8c9a-fb4b-9765-eb7c87b630f3/snap_1/00000bc9.data Host: cbs-qa-minio-01.rtp.openenglab.netapp.com Content-Md5: MZrW8kvvSNeLuImDUFZ1nw== Content-Length: 1465927 User-Agent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 X-Amz-Content-Sha256: fd286860146d276aec9c21b172bf0de93efbc39f0bff4d1a128282a54b8635af X-Amz-Date: 20190205T091415Z X-Amz-Meta-Md5: MZrW8kvvSNeLuImDUFZ1nw== Accept: */* Authorization: AWS4-HMAC-SHA256 Credential=O33UJS0EK838YTFPB010/20190205/us-east-1/s3/aws4_request,SignedHeaders=content-md5;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-meta-md5,Signature=c6100e862f231a3cb4b4cad5ca1aa493a541d6225e0cf3c558b18755d5185296 minio stdout messages: API: PutObject(bucket=fst03, object=01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data) Time: 09:14:11 UTC 02/05/2019 RequestID: 15806C2156F70B1A RemoteHost: 10.193.145.135 UserAgent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 Error: read tcp 10.193.144.201:9000->10.193.145.135:52100: read: connection reset by peer 1: cmd/fs-v1-helpers.go:349:cmd.fsCreateFile() 2: cmd/fs-v1.go:910:cmd.(*FSObjects).putObject() 3: cmd/fs-v1.go:818:cmd.(*FSObjects).PutObject() 4: cmd/object-handlers.go:1153:cmd.ObjectLayer.PutObject-fm() 5: cmd/object-handlers.go:1274:cmd.objectAPIHandlers.PutObjectHandler() 6: cmd/api-router.go:77:cmd.objectAPIHandlers.PutObjectHandler-fm() 7: pkg/handlers/http-tracer.go:144:handlers.TraceReqHandlerFunc.func1() 8: net/http/server.go:1964:http.HandlerFunc.ServeHTTP() API: PutObject(bucket=fst03, object=01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data) Time: 09:14:22 UTC 02/05/2019 RequestID: 15806CEF66D9F898 RemoteHost: 10.193.145.135 UserAgent: APN/1.0 NetApp/1.0 Cloud_Backup_Service/0.5 Error: read tcp 10.193.144.203:9000->10.193.145.135:36546: read: connection reset by peer 1: cmd/fs-v1-helpers.go:349:cmd.fsCreateFile() 2: cmd/fs-v1.go:910:cmd.(*FSObjects).putObject() 3: cmd/fs-v1.go:818:cmd.(*FSObjects).PutObject() 4: cmd/object-handlers.go:1153:cmd.ObjectLayer.PutObject-fm() 5: cmd/object-handlers.go:1274:cmd.objectAPIHandlers.PutObjectHandler() 6: cmd/api-router.go:77:cmd.objectAPIHandlers.PutObjectHandler-fm() 7: pkg/handlers/http-tracer.go:144:handlers.TraceReqHandlerFunc.func1() 8: net/http/server.go:1964:http.HandlerFunc.ServeHTTP() ## Regression don't know. ## Your Environment * Version used (`minio version`): root@cbs-qa-minio-01:/opt/minio/ajay# ../minio version Version: 2019-01-10T00:21:20Z Release-Tag: RELEASE.2019-01-10T00-21-20Z Commit-ID: ed1275a06345f578c3a21fd62458f3c7c96053ac root@cbs-qa-minio-01:/opt/minio/ajay# uname -a Linux cbs-qa-minio-01 4.15.0-44-generic #47-Ubuntu SMP Mon Jan 14 11:26:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux root@cbs-qa-minio-01:/opt/minio/ajay# Environment name and version (e.g. nginx 1.9.1): * * Link to your project:
harshavardhana commented 5 years ago

From what it looks like Minio is getting an error from the backend here @ajaykushwah do you have for example tcpdumps available for us to look at the traffic?

ajaykushwah commented 5 years ago

@harshavardhana, thanks for the quick response.

No, I don't have tcpdump and it would not be practical to catch it in action. It happens very infrequently with no pattern. Sometimes without an issue for days. TCPdump would be huge if it runs that long. We could try it as a last options if nothing else is obvious.

I can attach the client side logs for the PUT request, I mentioned above to see how client is retrying if that helps. It is a simple case of timeout after 5 min, closing the connection and retrying it again.

What I am most interested is in, what is the setting for put/operation timeouts on the minio server side so that I can either change that to 5 mins or ensure that client waits that long before giving up on the connection and retrying the request.

Similarly, what is the the expected minio behavior on these race like conditions? Does a failed request on certain object alters the existing state of that object, which was put successfully from previous requests? ( or the requests that finished first and successfully in a race like condition)

harshavardhana commented 5 years ago

What I am most interested is in, what is the setting for put/operation timeouts on the minio server side so that I can either change that to 5 mins or ensure that client waits that long before giving up on the connection and retrying the request.

If longer than 5 minutes network stops reading data, meaning we can't write fast enough to the backend then the client times out. Please share the entire Minio server logs. AFAICS backend here is giving up upon load, so Minio is unreliable because your backend is unreliable.

Similarly, what is the the expected minio behavior on these race like conditions? Does a failed request on certain object alters the existing state of that object, which was put successfully from previous requests? ( or the requests that finished first and successfully in a race like condition)

There is no race condition here, Minio doesn't write anything if it cannot write any - it is strictly consistent.

ajaykushwah commented 5 years ago

I am leaving the client side log snippet for request and retries of bad object 000008a9.data. (Detailed minio logs for this objects are available in my opening comment)

minio Snip REQUEST objectAPIHandlers.PutObjectHandler-fm] [154935716.226429] [2019-02-05 08:59:22 +0000] PUT /fst03/01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data Snip Ends

Client log Snip Log:[09:04:24.254304 1 (40) /retry info] retry 1 of put fst03 / 01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data after receiving 28: "Timeout was reached", status=Connection timed out, httpStatusC stream:stdout time:February 5th 2019, 09:04:24.254 pid:1 thread_id:40 module_path:/retry level:info msg:retry 1 of put fst03 /01d/...

log:[09:09:28.263051 1 (40) /retry info] retry 2 of put fst03 / 01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data after receiving 28: "Timeout was reached", status=Connection timed out, httpStatusC stream:stdout time:February 5th 2019, 09:09:28.263 pid:1 thread_id:40 module_path:/retry level:info msg:retry 2 of put fst03/...

log:[09:14:15.475632 1 (40) /retry info] retry 3 of put fst03 / 01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data after receiving 7: "Couldn't connect to server", status=Connection timed out, httpS stream:stdout time:February 5th 2019, 09:14:15.475 pid:1 thread_id:40 module_path:/retry level:info msg:retry 3 of put fst03/...

log:[09:14:20.813843 1 (40) /retry info] retry 3 of put fst03 / 01d/ay3e02WfEW63K92jDMB42iyhOvJ7446fzKnMvxOzs2A/V1/7b8223bf-3f0c-504e-b50c-ea3716119f85/snap_156/000008a9.data successful stream:stdout time:February 5th 2019, 09:14:20.813 pid:1 thread_id:40 module_path:/retry level:info msg:retry 3 of put fst03/...

ajaykushwah commented 5 years ago

If longer than 5 minutes network stops reading data, meaning we can't write fast enough to the backend then the client times out. Please share the entire Minio server logs. AFAICS backend here is giving up upon load, so Minio is unreliable because your backend is unreliable.

I am not ruling that out and I can certainly send you the logs I have. These are only http trace logs and the errors caught on stdout, which seems pretty verbose. Would that be all or is there a special setting needed to create the full system logs?

There is no race condition here, Minio doesn't write anything if it cannot write any - it is strictly consistent.

I have outlined the put logs from the client side in my last comment and there it clearly reflects that it got successful put response after few timeouts/retries. how do we reconcile the behavior as seen from client side?

harshavardhana commented 5 years ago

I have outlined the put logs from the client side in my last comment and there it clearly reflects that it got successful put response after few timeouts/retries. how do we reconcile the behavior as seen from client side?

Of course it can succeed depending on if backend allowed the request, so its not a race condition. Race condition is something very different.

I am not ruling that out and I can certainly send you the logs I have. These are only http trace logs and the errors caught on stdout, which seems pretty verbose. Would that be all or is there a special setting needed to create the full system logs?

Yes full system logs and also the minio servers logs you can attach them here (filtering any sensitive information)

harshavardhana commented 5 years ago

I am expecting that if a particular object PUT has been returned a OK by minio to the client, it should return with all of its metadata during GET.

If there are any race conditions that occur from retires due to network shakiness or server slowness, they should be handled correctly by the server.

Missing fs.json means that the file was deleted by someone i.e perhaps its the client itself sending a DELETE upon a failure? It is also possible that during rename of fs.json we got an error from the backend - but we make NAS gateway resilient to ignore that because FS is allowed to work without fs.json .

ajaykushwah commented 5 years ago

Of course it can succeed depending on if backend allowed the request, so its not a race condition. Race condition is something very different.

by race, I meant that requestID for original request is finishing later than the retry requestId as seen in the system minio http trace logs. but I am not obviously a minio expert here, so I will let the experts (you) determine what is going on.

Yes full system logs and also the minio servers logs you can attach them here (filtering any sensitive information)

just reconfirming that minio whether logs collected with export MINIO_HTTP_TRACE=.. are good enough as minio server logs.

If there is another setting to collect more detailed minio logs, please let me know.

ajaykushwah commented 5 years ago

Missing fs.json means that the file was deleted by someone i.e perhaps its the client itself sending a DELETE upon a failure?

I checked both client and minio http logs. There is no delete call on either sender or receiver side.

It is also possible that during rename of fs.json we got an error from the backend - but we make NAS gateway resilient to ignore that because FS is allowed to work without fs.json .

This is more likely. Would there be a error or log message that would be generated in that case?

ajaykushwah commented 5 years ago

nohup.out.gz

stdout captured from minio server

ajaykushwah commented 5 years ago

http_trace logs from the time around the upload of object in question minio_http_frag_35.gz

ajaykushwah commented 5 years ago

system cpu

root@cbs-qa-minio-01:/home/cbs# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz Stepping: 1 CPU MHz: 1197.139 CPU max MHz: 1700.0000 CPU min MHz: 1200.0000 BogoMIPS: 3391.95 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5 NUMA node1 CPU(s): 6-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts flush_l1d

system memory root@cbs-qa-minio-01:/opt/minio/ajay# free -h total used free shared buff/cache available Mem: 62G 2.6G 21G 672K 38G 59G Swap: 8.0G 22M 8.0G

File system: root@cbs-qa-minio-01:/opt/minio/ajay# df -h /data Filesystem Size Used Avail Use% Mounted on /dev/mapper/data-lvol0 748T 262T 487T 35% /data

root@cbs-qa-minio-01:/opt/minio/ajay# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/data-lvol0 16063177664 393420568 15669757096 3% /data

root@cbs-qa-minio-01:/opt/minio/ajay# file -sL /dev/mapper/data-lvol0 /dev/mapper/data-lvol0: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)

File system syslog does not have anything from minio

It does have occasional log messages reproduced below, which I am trying to root cause but none of the bad objects seems to be around the vicinity of these swapper failure

root@cbs-qa-minio-01:/opt/minio/ajay# zcat /var/log/syslog.7cat /var/log/syslog.7 ...

Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763873] swapper/10 cpuset=/ mems_allowed=0-1 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763885] CPU: 10 PID: 0 Comm: swapper/10 Not tainted 4.15.0-44-generic #47-Ubuntu Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763887] Hardware name: FUJITSU PRIMERGY RX2510 M2/D3279-H1, BIOS V5.0.0.11 R1.20.0 for D3279-H1x 06/15/2018 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763889] Call Trace: Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763892] Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763906] dump_stack+0x63/0x8b Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763915] warn_alloc+0xff/0x1a0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763920] alloc_pages_slowpath+0xddd/0xe20 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763926] alloc_pages_nodemask+0x263/0x280 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763935] alloc_pages_current+0x6a/0xe0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763940] new_slab+0x2d4/0x670 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763944] _slab_alloc+0x357/0x4b0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763952] ? build_skb+0x2b/0xf0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763958] ? tcp_gro_receive+0x1e2/0x2e0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763962] slab_alloc+0x20/0x40 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763966] ? __slab_alloc+0x20/0x40 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763969] kmem_cache_alloc+0x17b/0x1b0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763972] ? build_skb+0x2b/0xf0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763976] __build_skb+0x2b/0xf0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763980] build_skb+0x17/0x90 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.763999] i40e_clean_rx_irq+0x72d/0xee0 [i40e] Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764003] ? skb_free_head+0x21/0x30 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764013] i40e_napi_poll+0x402/0x850 [i40e] Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764018] net_rx_action+0x140/0x3a0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764025] __do_softirq+0xe4/0x2bb Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764033] irq_exit+0xb8/0xc0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764036] do_IRQ+0x8a/0xd0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764043] common_interrupt+0x84/0x84 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764044] Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764052] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764054] RSP: 0018:ffff9b914634fe68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764058] RAX: ffff89b57fd22880 RBX: 000884d03f4d3f95 RCX: 000000000000001f Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764060] RDX: 000884d03f4d3f95 RSI: ffffffebf5a0a2f2 RDI: 0000000000000000 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764062] RBP: ffff9b914634fea8 R08: 00000000ffffffff R09: 000000000000000f Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764063] R10: ffff9b914634fe38 R11: 000000000000002d R12: ffffbb913fd00238 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764065] R13: 0000000000000001 R14: ffffffffab971c38 R15: 0000000000000000 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764071] ? cpuidle_enter_state+0x97/0x2f0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764075] cpuidle_enter+0x17/0x20 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764080] call_cpuidle+0x23/0x40 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764084] do_idle+0x18c/0x1f0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764088] cpu_startup_entry+0x73/0x80 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764093] start_secondary+0x1ab/0x200 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764098] secondary_startup_64+0xa5/0xb0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764103] SLUB: Unable to allocate memory on node -1, gfp=0x1080020(GFP_ATOMIC) Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764106] cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764108] node 0: slabs: 234, objs: 7072, free: 6 Feb 26 18:45:07 cbs-qa-minio-01 kernel: [2397829.764111] node 1: slabs: 196, objs: 6080, free: 38

ajaykushwah commented 5 years ago

Missing fs.json means that the file was deleted by someone i.e perhaps its the client itself sending a DELETE upon a failure?

I checked both client and minio http logs. There is no delete call on either sender or receiver side.

It is also possible that during rename of fs.json we got an error from the backend - but we make NAS gateway resilient to ignore that because FS is allowed to work without fs.json .

This is more likely. Would there be a error or log message that would be generated in that case?

@harshavardhana

Any suggestions on where the fs.json files that potentially failed rename operations files might be left sitting?

I checked under the bucket directory (/data/fst03) and under the metadata directory (/data/.minio.sys) for possibly left out files from rename failure but come up blank.

Also, are the logs I have provided so far sufficient or do I need to upload something more?

Thanks.

harshavardhana commented 4 years ago

Closing this issue as stale