mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Database usarray2012 could not be created :: caused by :: No shards found #453

Open Aristoeu opened 1 year ago

Aristoeu commented 1 year ago

There is an error when I run dataprep_v2.ipynb with distributed_node.sh on frontera. I used the newest image for master branch.

OperationFailure: Database usarray2012 could not be created :: caused by :: No shards found, full error: {'ok': 0.0, 'errmsg': 'Database usarray2012 could not be created :: caused by :: No shards found', 'code': 70, 'codeName': 'ShardNotFound', '$clusterTime': {'clusterTime': Timestamp(1692567005, 1), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1692567005, 1)}

Everything works fine with single_node.sh, so I think the reason is within distributed_node.sh this script, when we update mongo to 6.0 and other packages, something changed. I tried adding write concerns, but still didn't work.

The full error:

OperationFailure Traceback (most recent call last) Cell In[11], line 6 4 for dfile in filelist: 5 t0=time() ----> 6 db.index_mseed_file(dfile,dir) 7 t1=time() 8 dt=t1-t0

File /opt/conda/lib/python3.10/site-packages/mspasspy/db/database.py:5769, in Database.index_mseed_file(self, dfile, dir, collection, segment_time_tears, elog_collection, return_ids, normalize_channel, verbose) 5767 doc["dir"] = odir 5768 doc["dfile"] = dfile -> 5769 thisid = dbh.insert_one(doc).inserted_id 5770 ids_affected.append(thisid) 5771 if normalize_channel: 5772 # these quantities are always defined unless there was a read error 5773 # and I don't think we can get here if we had a read error.

File /opt/conda/lib/python3.10/site-packages/pymongo/collection.py:639, in Collection.insert_one(self, document, bypass_document_validation, session, comment) 635 document["_id"] = ObjectId() # type: ignore[index] 637 write_concern = self._write_concern_for(session) 638 return InsertOneResult( --> 639 self._insert_one( 640 document, 641 ordered=True, 642 write_concern=write_concern, 643 op_id=None, 644 bypass_doc_val=bypass_document_validation, 645 session=session, 646 comment=comment, 647 ), 648 write_concern.acknowledged, 649 )

File /opt/conda/lib/python3.10/site-packages/pymongo/collection.py:579, in Collection._insert_one(self, doc, ordered, write_concern, op_id, bypass_doc_val, session, comment) 567 result = sock_info.command( 568 self.database.name, 569 command, (...) 574 retryable_write=retryable_write, 575 ) 577 _check_write_command_response(result) --> 579 self.database.client._retryable_write(acknowledged, _insert_command, session) 581 if not isinstance(doc, RawBSONDocument): 582 return doc.get("_id")

File /opt/conda/lib/python3.10/site-packages/pymongo/mongo_client.py:1493, in MongoClient._retryable_write(self, retryable, func, session) 1491 """Internal retryable write helper.""" 1492 with self._tmp_session(session) as s: -> 1493 return self._retry_with_session(retryable, func, s, None)

File /opt/conda/lib/python3.10/site-packages/pymongo/mongo_client.py:1360, in MongoClient._retry_with_session(self, retryable, func, session, bulk) 1350 """Execute an operation with at most one consecutive retries 1351 1352 Returns func()'s return value on success. On error retries the same (...) 1355 Re-raises any exception thrown by func(). 1356 """ 1357 retryable = ( 1358 retryable and self.options.retry_writes and session and not session.in_transaction 1359 ) -> 1360 return self._retry_internal(retryable, func, session, bulk)

File /opt/conda/lib/python3.10/site-packages/pymongo/_csot.py:106, in apply..csot_wrapper(self, *args, kwargs) 104 with _TimeoutContext(timeout): 105 return func(self, *args, *kwargs) --> 106 return func(self, args, kwargs)

File /opt/conda/lib/python3.10/site-packages/pymongo/mongo_client.py:1401, in MongoClient._retry_internal(self, retryable, func, session, bulk) 1399 raise last_error 1400 retryable = False -> 1401 return func(session, sock_info, retryable) 1402 except ServerSelectionTimeoutError: 1403 if is_retrying(): 1404 # The application may think the write was never attempted 1405 # if we raise ServerSelectionTimeoutError on the retry 1406 # attempt. Raise the original exception instead.

File /opt/conda/lib/python3.10/site-packages/pymongo/collection.py:567, in Collection._insert_one.._insert_command(session, sock_info, retryable_write) 564 if bypass_doc_val: 565 command["bypassDocumentValidation"] = True --> 567 result = sock_info.command( 568 self.database.name, 569 command, 570 write_concern=write_concern, 571 codec_options=self.__write_response_codec_options, 572 session=session, 573 client=self.database.client, 574 retryable_write=retryable_write, 575 ) 577 _check_write_command_response(result)

File /opt/conda/lib/python3.10/site-packages/pymongo/helpers.py:279, in _handle_reauth..inner(*args, *kwargs) 276 from pymongo.pool import SocketInfo 278 try: --> 279 return func(args, **kwargs) 280 except OperationFailure as exc: 281 if no_reauth:

File /opt/conda/lib/python3.10/site-packages/pymongo/pool.py:879, in SocketInfo.command(self, dbname, spec, read_preference, codec_options, check, allowable_errors, read_concern, write_concern, parse_write_concern_error, collation, session, client, retryable_write, publish_events, user_fields, exhaust_allowed) 877 self._raise_if_not_writable(unacknowledged) 878 try: --> 879 return command( 880 self, 881 dbname, 882 spec, 883 self.is_mongos, 884 read_preference, 885 codec_options, 886 session, 887 client, 888 check, 889 allowable_errors, 890 self.address, 891 listeners, 892 self.max_bson_size, 893 read_concern, 894 parse_write_concern_error=parse_write_concern_error, 895 collation=collation, 896 compression_ctx=self.compression_context, 897 use_op_msg=self.op_msg_enabled, 898 unacknowledged=unacknowledged, 899 user_fields=user_fields, 900 exhaust_allowed=exhaust_allowed, 901 write_concern=write_concern, 902 ) 903 except (OperationFailure, NotPrimaryError): 904 raise

File /opt/conda/lib/python3.10/site-packages/pymongo/network.py:166, in command(sock_info, dbname, spec, is_mongos, read_preference, codec_options, session, client, check, allowable_errors, address, listeners, max_bson_size, read_concern, parse_write_concern_error, collation, compression_ctx, use_op_msg, unacknowledged, user_fields, exhaust_allowed, write_concern) 164 client._process_response(response_doc, session) 165 if check: --> 166 helpers._check_command_response( 167 response_doc, 168 sock_info.max_wire_version, 169 allowable_errors, 170 parse_write_concern_error=parse_write_concern_error, 171 ) 172 except Exception as exc: 173 if publish:

File /opt/conda/lib/python3.10/site-packages/pymongo/helpers.py:194, in _check_command_response(response, max_wire_version, allowable_errors, parse_write_concern_error) 191 elif code == 43: 192 raise CursorNotFound(errmsg, code, response, max_wire_version) --> 194 raise OperationFailure(errmsg, code, response, max_wire_version)

OperationFailure: Database usarray2012 could not be created :: caused by :: No shards found, full error: {'ok': 0.0, 'errmsg': 'Database usarray2012 could not be created :: caused by :: No shards found', 'code': 70, 'codeName': 'ShardNotFound', '$clusterTime': {'clusterTime': Timestamp(1692567005, 1), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1692567005, 1)}

wangyinz commented 1 year ago

This suggests that the MongoDB cluster is not started correctly. I wonder if you are using sharding or not. First of all, you don't have to enable sharding. But if you do, I think you need to check the MongoDB's status and see whether the shards are created and connected to mongos correctly. Also, there is probably some special procedures needed in creating a database that is sharded (I don't remember it quite well and this feature was never tested anyway). Please read Mongo's document carefully and see if there is anything needs to update in our startup script within the container.

pavlis commented 1 year ago

You might also want to review the sections in the User Manual on this topic:

  1. The HPC setup document.
  2. The cluster concept document

Generally useful advice for anyone dealing with configuration issues on a cluster of any kind.

Aristoeu commented 1 year ago

After setting DB_SHARDING=false, and running mongod again, the problem is solved

wangyinz commented 1 year ago

This is expected as the problem is almost absolutely from the wrong sharding setup. Still, we need to make sure to setup sharding correctly in this new version.