peelframework / peel

Peel is a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms.
http://peel-framework.org
Apache License 2.0
27 stars 32 forks source link

Order of system dependency execution during setup / teardown #96

Closed noproblem666 closed 8 years ago

noproblem666 commented 8 years ago

Hi,

this issue is related to #93.

I have configured the following system bean flink-0.10.2 in systems.xml:

<!-- Flink (overridden 'flink-0.10.2' bean that depends on 'hdfs-2.7.1') -->
    <bean id="flink-0.10.2" class="org.peelframework.flink.beans.system.Flink" parent="system">
        <constructor-arg name="version" value="0.10.2"/>
        <constructor-arg name="configKey" value="flink" />
        <constructor-arg name="lifespan" value="EXPERIMENT"/>
        <constructor-arg name="dependencies">
            <set value-type="org.peelframework.core.beans.system.System">
                <ref bean="hdfs-2.7.1"/>
                <ref bean="yarn-2.7.1"/>
            </set>
        </constructor-arg>
    </bean>

When I run ./peel.sh sys:setup flink-0.10.2 I will get the following log output:

16-08-01 14:14:08 [INFO] ############################################################
16-08-01 14:14:08 [INFO] #           PEEL EXPERIMENTS EXECUTION FRAMEWORK           #
16-08-01 14:14:08 [INFO] ############################################################
16-08-01 14:14:08 [INFO] Setting up system 'flink-0.10.2' and dependencies with SUITE or EXPERIMENT lifespan
16-08-01 14:14:08 [INFO] Constructing dependency graph for system 'flink-0.10.2'
16-08-01 14:14:08 [INFO] Loading configuration for system 'flink-0.10.2'
16-08-01 14:14:08 [INFO] +-- Loading resource reference.peel.conf
16-08-01 14:14:08 [INFO] +-- Loading resource reference.yarn-2.7.1.conf
16-08-01 14:14:08 [INFO] +-- Loading file /usr/local/share/peel/peel-wordcount/config/yarn-2.7.1.conf
16-08-01 14:14:08 [INFO] +-- Skipping file /usr/local/share/peel/peel-wordcount/config/hosts/flink-s0/yarn-2.7.1.conf (does not exist)
16-08-01 14:14:08 [INFO] +-- Loading resource reference.flink-0.10.2.conf
16-08-01 14:14:08 [INFO] +-- Loading file /usr/local/share/peel/peel-wordcount/config/flink-0.10.2.conf
16-08-01 14:14:08 [INFO] +-- Skipping file /usr/local/share/peel/peel-wordcount/config/hosts/flink-s0/flink-0.10.2.conf (does not exist)
16-08-01 14:14:08 [INFO] +-- Loading resource reference.hdfs-2.7.1.conf
16-08-01 14:14:08 [INFO] +-- Loading file /usr/local/share/peel/peel-wordcount/config/hdfs-2.7.1.conf
16-08-01 14:14:08 [INFO] +-- Skipping file /usr/local/share/peel/peel-wordcount/config/hosts/flink-s0/hdfs-2.7.1.conf (does not exist)
16-08-01 14:14:08 [INFO] +-- Loading file /usr/local/share/peel/peel-wordcount/config/application.conf
16-08-01 14:14:08 [INFO] +-- Skipping file /usr/local/share/peel/peel-wordcount/config/hosts/flink-s0/application.conf (does not exist)
16-08-01 14:14:08 [INFO] +-- Loading current runtime values as configuration
16-08-01 14:14:08 [INFO] +-- Loading system properties as configuration
16-08-01 14:14:08 [INFO] `-- Resolving configuration
16-08-01 14:14:08 [INFO] Starting system 'yarn-2.7.1'
16-08-01 14:14:08 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s2
16-08-01 14:14:08 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s1
16-08-01 14:14:08 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/hadoop-2.7.1 hduser@flink-s1:/usr/local/share/peel/peel-wordcount/systems --exclude hadoop-2.7.1/logs/*
16-08-01 14:14:08 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/hadoop-2.7.1 hduser@flink-s2:/usr/local/share/peel/peel-wordcount/systems --exclude hadoop-2.7.1/logs/*
16-08-01 14:14:11 [INFO] Waiting for nodes to connect
16-08-01 14:14:11 [INFO] Connected 0 from 2 nodes
16-08-01 14:14:17 [INFO] Connected 2 from 2 nodes
16-08-01 14:14:17 [INFO] System 'yarn-2.7.1' is now up and running
16-08-01 14:14:17 [INFO] Starting system 'flink-0.10.2'
16-08-01 14:14:17 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s1
16-08-01 14:14:17 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s2
16-08-01 14:14:18 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/flink-0.10.2 hduser@flink-s1:/usr/local/share/peel/peel-wordcount/systems --exclude flink-0.10.2/log/*
16-08-01 14:14:18 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/flink-0.10.2 hduser@flink-s2:/usr/local/share/peel/peel-wordcount/systems --exclude flink-0.10.2/log/*
16-08-01 14:14:18 [INFO] Initializing Flink tmp directories '/usr/local/share/peel/data/flink/tmp' at flink-s1
16-08-01 14:14:18 [INFO] Initializing Flink tmp directories '/usr/local/share/peel/data/flink/tmp' at flink-s2
16-08-01 14:14:22 [INFO] Waiting for nodes to connect
16-08-01 14:14:22 [INFO] Connected 0 from 2 nodes
16-08-01 14:14:28 [INFO] System 'flink-0.10.2' is now up and running
16-08-01 14:14:29 [INFO] Starting system 'hdfs-2.7.1'
16-08-01 14:14:29 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s1
16-08-01 14:14:29 [INFO] creating directory /usr/local/share/peel/peel-wordcount/systems on remote host flink-s2
16-08-01 14:14:29 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/hadoop-2.7.1 hduser@flink-s1:/usr/local/share/peel/peel-wordcount/systems --exclude hadoop-2.7.1/logs/*
16-08-01 14:14:29 [INFO] rsync -a /usr/local/share/peel/peel-wordcount/systems/hadoop-2.7.1 hduser@flink-s2:/usr/local/share/peel/peel-wordcount/systems --exclude hadoop-2.7.1/logs/*
16-08-01 14:14:29 [INFO] Formatting namenode
16-08-01 14:14:32 [INFO] Initializing HDFS data directory '/usr/local/share/peel/data/hadoop-2/data' at flink-s1
16-08-01 14:14:32 [INFO] Initializing HDFS data directory '/usr/local/share/peel/data/hadoop-2/data' at flink-s2
16-08-01 14:14:52 [INFO] Waiting for nodes to connect
16-08-01 14:14:55 [INFO] Connected 0 from 2 nodes, safemode is OFF
16-08-01 14:15:07 [INFO] System 'hdfs-2.7.1' is now up and running

As you can see the execution order is: YARN, Flink, HDFS.

When I run ./peel.sh exp:run wordcount.default wordcount.flink the execution order will change to: HDFS, Flink, YARN.

This behavior is a bit strange because I would assume that peel will start the system dependencies, HDFS and YARN in this case, before the system bean (flink-0.10.2) will start. Therefore from my point of view the order should be: YARN, HDFS, Flink.

Can you tell me if this behavior is intended and how I can fix this?

Thanks!

aalexandrov commented 8 years ago

Can you tell me if this behavior is intended and how I can fix this?

Seems like a bug.

aalexandrov commented 8 years ago

@noproblem666 Can you list the bean config for the wordcount.default suite and the wordcount.flink experiment?

noproblem666 commented 8 years ago

You can find my current experiment configuration here.

akunft commented 8 years ago

I had some thoughts related to this issue lately. Should we introduce ordered dependencies, aka. change from Set to List, over the transitive dependencies of a system?

I have no real problem with the current implementation, but for instance when we tear down an experiment running with Spark on HDFS, we (randomly) shut down HDFS before Spark which should not be the case as Spark depends on HDFS (basically it should be the reverse of the setup phase).

I think the same should apply in the above example, were the dependency chain should be Flink, Yarn, HDFS meaning setup HDFS, Yarn and then Flink.

aalexandrov commented 8 years ago

This sounds more like a bug in the set based representation.

Setup is in reversed topological order (dependencies before dependent syatwns) and teardown in topological order (dependent system before dependencies).

If it does not happen like that the bug is either in the framework or in the configuration.

aalexandrov commented 8 years ago

Have you added HDFS as a dependency for Spark in your configuration?

akunft commented 8 years ago

Yes

aalexandrov commented 8 years ago

Sorry I think we got to the bottom of the original issue and it was rated to the order of loading the different configuration layers.