vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Unable to remove document types upon redeployment. #22755

Closed 107dipan closed 2 years ago

107dipan commented 2 years ago

Describe the bug The cluster state is going down when removing schemas from the cluster. Currently I have 18 content node cluster. I have only one cluster type defined and all the schemas are part of the same cluster type.

To Reproduce Steps to reproduce the behavior:

  1. Deploy application with multiple schemas
  2. After successful deployment ingest some data in the cluster.
  3. Remove certain number of docTypes from the cluster. Expected behavior Cluster should remai functional

Screenshots [2022-05-25 12:15:04.180] WARNING : searchnode proton.transactionlog.server Failed deleting someDocTyoe domain. Exception = IoException: DIRECTORY HAVE CONTENT: rmdir(tls/tls/someDocTyoe, recursive): Failed, errno(39): Directory not empty at rmdir in /builddir/build/BUILD/vespa-7.559.12/vespalib/src/vespa/vespalib/io/fileutil.cpp:580\nBacktrace:\n /opt/vespa/lib64/libvespalib.so(vespalib::IoException::IoException(vespalib::stringref, vespalib::IoException::Type, vespalib::stringref, int)+0x41) [0x7f6e017d96f1]\n /opt/vespa/lib64/libvespalib.so(+0x195d74) [0x7f6e01674d74]\n /opt/vespa/lib64/libsearchlib.so(search::transactionlog::TransLogServer::deleteDomain(FRT_RPCRequest)+0x3bf) [0x7f6e0397295f]\n /opt/vespa/lib64/libsearchlib.so(search::transactionlog::TransLogServer::run()+0x34b) [0x7f6e03973f9b]\n /opt/vespa/lib64/libstaging_vespalib.so(document::Runnable::Run(FastOS_ThreadInterface, void*)+0x4b) [0x7f6e0192aa2b]\n /opt/vespa/lib64/libfastos.so(FastOS_ThreadInterface::Hook()+0x109) [0x7f6e014c72f9]\n /opt/vespa/lib64/libfastos.so(FastOS_ThreadHook+0x9) [0x7f6e014c7499]\n /opt/vespa/lib64/vespa/malloc/libvespamalloc.so(+0x11fab) [0x7f6e04950fab]\n /lib64/libpthread.so.0(+0x7ea5) [0x7f6e01094ea5]\n /lib64/libc.so.6(clone+0x6d) [0x7f6df87a1b0d] [2022-05-25 12:15:04.181] FATAL : searchnode proton.proton.server.proton_disk_layout Failed to remove tls domain someDocTyoe [2022-05-25 12:15:04.182] ERROR : searchnode proton /builddir/build/BUILD/vespa-7.559.12/searchcore/src/vespa/searchcore/proton/server/proton_disk_layout.cpp:90: Abort called. Reason: Failed to remove tls domain [2022-05-25 12:15:04.184] WARNING : searchnode stderr /builddir/build/BUILD/vespa-7.559.12/searchcore/src/vespa/searchcore/proton/server/proton_disk_layout.cpp:90: Abort called. Reason: Failed to remove tls domain

*Environment (please complete the following information):

Vespa version 7.559.12

107dipan commented 2 years ago

Previous discussion with vespa team : https://github.com/vespa-engine/vespa/issues/22755#issue-1248016484

toregge commented 2 years ago

The crash is likely due to readdir() not returning all entries in a directory when some entries are removed during the directory scan.

geirst commented 2 years ago

We have not been able to reproduce this particular problem. As a possible improvement we have rewritten the code that deletes directories (as part of removing a document type) to using std::filesystem. See https://github.com/vespa-engine/vespa/pull/22851 (in 7.594.7).