scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.24k stars 1.25k forks source link

flush api should not fail on no_such_column_family/no_such_keyspace #16095

Open bhalevy opened 10 months ago

bhalevy commented 10 months ago

There is nothing to prevent keyspaces or tables to be dropped during flush and returning no_such_column_family or no_such_keyspace is futile since the end result if the table was dropped right after a successful flush is the same and the flushed sstables would be removed anyhow.

This is causing for example the following dtest failure: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/420/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split008___test_multi_mvs_on_different_base_tables/

self = <materialized_views_test.TestMaterializedViews object at 0x7fdc16ca94d0>

    @pytest.mark.scylla_mode('!debug')
    def test_multi_mvs_on_different_base_tables(self):
        """ Few keyspaces and every keyspace has a few tables and every table has a few MVs.
            MVs are created on the empty base tables
        """
>       self._multi_mvs_on_different_base_tables_multi_ks(rf=3, tables=3, mvs=4, prefill_start=10000,
                                                          increase_rows=10, populated_table=False)

materialized_views_test.py:1201: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
materialized_views_test.py:1223: in _multi_mvs_on_different_base_tables_multi_ks
    run_in_parallel(proc_functions)
tools/data.py:241: in run_in_parallel
    results = [task.result() for task in tasks]
tools/data.py:241: in <listcomp>
    results = [task.result() for task in tasks]
/usr/lib64/python3.11/concurrent/futures/_base.py:456: in result
    return self.__get_result()
/usr/lib64/python3.11/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/usr/lib64/python3.11/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
materialized_views_test.py:1249: in _multi_mvs_on_different_base_tables
    func()
materialized_views_test.py:1229: in _prefill_base_tables
    base_table.prefill_table(prefill)
tools/tables_view_manager.py:214: in prefill_table
    flush_by_node(self.cluster)
tools/misc.py:287: in flush_by_node
    node.flush()
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:1307: in flush
    super(ScyllaNode, self).flush(ks, table, **kwargs)
../scylla/.local/lib/python3.11/site-packages/ccmlib/node.py:1414: in flush
    self.nodetool(cmd, **kwargs)
../scylla/.local/lib/python3.11/site-packages/ccmlib/scylla_node.py:758: in nodetool
    return super().nodetool(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ccmlib.scylla_node.ScyllaNode object at 0x7fdc17d9f2d0>, cmd = 'flush'
capture_output = True, wait = True, timeout = None, verbose = True

    def nodetool(self, cmd, capture_output=True, wait=True, timeout=None, verbose=True):
        """
        Setting wait=False makes it impossible to detect errors,
        if capture_output is also False. wait=False allows us to return
        while nodetool is still running.
        When wait=True, timeout may be set to a number, in seconds,
        to limit how long the function will wait for nodetool to complete.
        """
        if capture_output and not wait:
            raise common.ArgumentError("Cannot set capture_output while wait is False.")
        env = self.get_env()
        if self.is_scylla() and not self.is_docker():
            host = self.address()
        else:
            host = 'localhost'
        nodetool = self.get_tool('nodetool')

        if not isinstance(nodetool, list):
            nodetool = [nodetool]
        # see https://www.oracle.com/java/technologies/javase/8u331-relnotes.html#JDK-8278972
        nodetool.extend(['-h', host, '-p', str(self.jmx_port), '-Dcom.sun.jndi.rmiURLParsing=legacy'])
        nodetool.extend(cmd.split())
        if verbose:
            self.debug(f"nodetool cmd={cmd} wait={wait} timeout={timeout}")
        if capture_output:
            p = subprocess.Popen(nodetool, universal_newlines=True, env=env, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            stdout, stderr = p.communicate(timeout=timeout)
        else:
            p = subprocess.Popen(nodetool, env=env, universal_newlines=True)
            stdout, stderr = None, None

        if wait:
            exit_status = p.wait(timeout=timeout)
            if exit_status != 0:
>               raise NodetoolError(" ".join(nodetool), exit_status, stdout, stderr)
E               ccmlib.node.ToolError: Subprocess /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/323e34e1ed40a8c41a1194c817ec13e38123d1d3/share/cassandra/bin/nodetool -h 127.0.40.1 -p 7199 -Dcom.sun.jndi.rmiURLParsing=legacy flush exited with non-zero status; exit status: 2; 
E               stderr: error: Scylla API server HTTP POST to URL '/storage_service/keyspace_flush/multi2' failed: data_dictionary::no_such_column_family (Can't find a column family with UUID 053f9ed0-84fa-11ee-9cef-f322e5589649)
E               -- StackTrace --
E               java.lang.IllegalStateException: Scylla API server HTTP POST to URL '/storage_service/keyspace_flush/multi2' failed: data_dictionary::no_such_column_family (Can't find a column family with UUID 053f9ed0-84fa-11ee-9cef-f322e5589649)
E                   at com.scylladb.jmx.api.APIClient.getException(APIClient.java:140)
E                   at com.scylladb.jmx.api.APIClient.post(APIClient.java:120)
E                   at com.scylladb.jmx.api.APIClient.post(APIClient.java:130)
E                   at com.scylladb.jmx.api.APIClient.post(APIClient.java:113)
E                   at org.apache.cassandra.service.StorageService.forceKeyspaceFlush(StorageService.java:731)
E                   at jdk.internal.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
E                   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E                   at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
E                   at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
E                   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E                   at java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260)
E                   at java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
E                   at java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
E                   at java.management/com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
E                   at java.management/com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
E                   at java.management/com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
E                   at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:809)
E                   at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
E                   at com.scylladb.jmx.utils.APIMBeanServer.invoke(APIMBeanServer.java:188)
E                   at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
E                   at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
E                   at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
E                   at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:827)
E                   at java.base/jdk.internal.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
E                   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E                   at java.rmi/sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:359)
E                   at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:200)
E                   at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:197)
E                   at java.base/java.security.AccessController.doPrivileged(Native Method)
E                   at java.rmi/sun.rmi.transport.Transport.serviceCall(Transport.java:196)
E                   at java.rmi/sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:562)
E                   at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:796)
E                   at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:677)
E                   at java.base/java.security.AccessController.doPrivileged(Native Method)
E                   at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:676)
E                   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
E                   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
E                   at java.base/java.lang.Thread.run(Thread.java:829)

../scylla/.local/lib/python3.11/site-packages/ccmlib/node.py:807: ToolError
bhalevy commented 10 months ago

https://github.com/scylladb/scylladb/pull/15820 changes this area too so this can be fixed after the former is merged (or in the same patchset)

Cc @denesb

bhalevy commented 9 months ago

https://github.com/scylladb/scylladb/pull/15820 was merged, so this issue can be worked on

dani-tweig commented 2 months ago

@bhalevy , why did you label it 'ci stability'? seems like a bug/required change in the code

bhalevy commented 2 months ago

@bhalevy , why did you label it 'ci stability'? seems like a bug/required change in the code

it was labeled as such since it was hit in dtest. being a bug that needs fixing is orthogonal to ci stability, isn't it?

dani-tweig commented 2 months ago

@bhalevy , why did you label it 'ci stability'? seems like a bug/required change in the code

it was labeled as such since it was hit in dtest. being a bug that needs fixing is orthogonal to ci stability, isn't it?

yes, I didnt notice that ci stability is only a symptom label.