noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
270 stars 78 forks source link

Segfault from mongodb in noobaa-core #5666

Open chancez opened 5 years ago

chancez commented 5 years ago

Environment info

Actual behavior

  1. MongoDB crashed and is segfaulting: https://gist.github.com/chancez/e4231d28a22dd2b82461eb040336b977

Expected behavior

  1. MongoDB should not segfault

Steps to reproduce

  1. aws --profile noobaa --endpoint-url https://my-noobaa-s3-elb.example.org:443 --no-verify-ssl s3 rm --recursive s3://metering-test-1/

More information - Screenshots / Logs / Other output

guymguym commented 5 years ago

Here is the full log of noobaa-server:

https://gist.github.com/chancez/66bac74c81d326a4398ac8acf144139b

jackyalbo commented 5 years ago

Tried reproducing the issue: same setup - openshift 4.1 with noobaa installed over it - using the operator - Didn't reproduce. I tried the following:

  1. Created a new bucket called meetering-test-1 (using s3 client)
  2. Uploaded to it 1000 file of 1MB (using aws-sdk client for node.js)
  3. deleted all files using the provided command:
    aws --profile noobaa --endpoint-url https://my-noobaa-s3-elb.example.org:443 --no-verify-ssl s3 rm --recursive s3://metering-test-1/
  4. Everything was deleted successfully.
  5. Tried again with 1000 files of 50MB each, and ~10K files of 500K - still everything worked

@chancez If you can help me with another step I missed or any differences in your setup i missed it will be great

chancez commented 5 years ago

Oh, I guess I also used a node pool + a cloud storage at the time. I was using Azure and 3 pods. I forgot about this. Sorry totally forgot about that.

chancez commented 5 years ago

I just got this again using only regular node pools, and no deletions of data happening. The only thing that happened was some nodes were rebooting during this time.

chancez commented 5 years ago

Honestly it mostly seems like a mongodb issue overall.

chancez commented 5 years ago

Logs https://gist.github.com/chancez/27eb74592410a3a4fc84174a659441f2

guymguym commented 5 years ago

The logs show mapReduce calls nearby so it might be that one mapReduce calls is causing it. For example I see this issue on OOM in the mapReduce js engine: https://jira.mongodb.org/browse/SERVER-28521

But if this is unrelated to the mapReduce that precedes it, then perhaps it is one of these general segfaults: https://jira.mongodb.org/browse/SERVER-38791 https://jira.mongodb.org/browse/SERVER-38776 (there's more)...

chancez commented 5 years ago

Your right, i do see

2019-10-16T19:49:49.069+0000 I COMMAND  [conn6] command nbcore.objectparts command: find { find: "objectparts", filter: { obj: ObjectId('5da7745871c75a0a59a47c4d') }, projection: { _id: 0, chunk: 1 }, returnKey: false, showRecordId: false, $db: "nbcore" } planSummary: COLLSCAN keysExamined:0 docsExamined:679550 cursorExhausted:1 numYields:5309 nreturned:0 reslen:91 locks:{ Global: { acquireCount: { r: 10620 } }, Database: { acquireCount: { r: 5310 } }, Collection: { acquireCount: { r: 5310 } } } protocol:op_msg 299ms

each time before the crash, reliably. 5da7745871c75a0a59a47c4d is always the ID preceding the crash.

kanadaj commented 2 years ago

This should be easily fixable by using a newer MongoDB image...

guymguym commented 2 years ago

Hey @chancez @kanadaj That's right, we a lagging behind with mongodb upgrades for a long while which is never a good thing, and I am interested to know if you use mongodb in particular and if we should keep maintaining the mongodb backend option (Red Hat changed to use only postgres as noobaa-db). Would be great to hear from you!

kanadaj commented 2 years ago

@guymguym I'm using the operator myself, and I'm not quite sure if that one support postgres properly as it still defaults to mongo, nor whether the operator supports migrating from mongo to postgres

guymguym commented 2 years ago

@kanadaj Yes the operator supports postgress + it migrates from mongo. This is available since release 5.7 (latest is now 5.8).

@dannyzaken @jackyalbo Where did you document the steps required to change an existing installation? I know this requires mostly just changing the noobaa CR spec.dbType to postgres and the operator will take action from there, but I am not sure if there are more spec that needs to change (dbImage/dbResources). Is this all that needs to be done? Is there any upstream wiki/doc to describe the process? Thanks

kanadaj commented 2 years ago

@guymguym Okay so, migration happens automatically when switching, but I had to manually set spec.dbImage to centos/postgresql-12-centos7 because even with spec.dbType set to postgres on Operator 5.8 (strangely enough it seems to still default to mongo) it was using the image for mongo.

liranmauda commented 2 years ago

Hi @chancez @kanadaj Do we still need this issue, or can we close it?

kanadaj commented 2 years ago

Since Postgres is the recommended installation more, probably good to close. That said, I think the operator might still default to mongo so that would be ideal to fix

liranmauda commented 2 years ago

@kanadaj Can you describe the way you installed. I am using the cli command install and the default is postgres.

We previously had an issue where an installation using only CRD defaulted to mongo, but as far as I know, it should be resolved by now. Please let me know.

kanadaj commented 2 years ago

I installed just with the CRD and the operator from operatorhub.

github-actions[bot] commented 1 week ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.