spring-attic / spring-cloud-task-app-starters-composed-task-runner

Task Starter for executing composed tasks
Apache License 2.0
14 stars 20 forks source link

v2.0.2 CTR process runs indefinitely #60

Closed sabbyanandan closed 5 years ago

sabbyanandan commented 5 years ago

As a user, while orchestrating a deeply nested-graph on CTR v2.0.2, I'm noticing that the CTR process continues to run even after successfully executing all the steps with exit-code=0. This behavior is not observed while using v2.0.0 release.

See https://github.com/spring-cloud/spring-cloud-dataflow/issues/2667 for more details.

mminella commented 5 years ago

I believe this is a symptom of #58 . If @cppwfs agrees, we can close this one as a duplicate.

sabbyanandan commented 5 years ago

This issue is on CTR v2.0 line (builds on SCT v2.0), though. #58 relates to SCT v2.1, however.

cppwfs commented 5 years ago

These are separate issues.

cppwfs commented 5 years ago

Hello @Rostish, I'm still having problems reproducing the problem. I ran the following graph (each task was a timestamp) 1) CTR 2.0.2 on SCDF 1.7.2 for 80 times 2) CTR 2.0.2 on SCDF 2.0 127 times (DB used was mysql).

screen shot 2019-01-03 at 1 27 40 pm

This graph was constructed after reviewing your log and deriving the basic flow of what you were trying to run.
The command I executed looked like this: java -jar composedtaskrunner-task-2.0.2.RELEASE.jar --spring.cloud.task.closecontextEnabled=true --increment-instance-enabled=true --split-thread-core-pool-size=4 --interval-time-between-checks=1000 --graph=""logrunme-1&&logrunme-2&&<logrunme-3||logrunme-4||logrunme-5||logrunme-6>&&<logrunme-7||logrunme-8||logrunme-9>&&logrunme-10&&<logrunme-11||logrunme-12>&&<logrunme-13||logrunme-14>&&logrunme-15&&<logrunme-16||logrunme-17||logrunme-18>&&logrunme-19&&<logrunme-20||logrunme-21||logrunme-22>"

Can you see a difference in my test case above and what you are executing?

Rostish commented 5 years ago

@cppwfs Good day for you!

i pass next arguments via REST Client launch command: --dataflow-server-uri: http://10.101.48.150:9494 (could it connect with problem?) --split-thread-core-pool-size: 5(as i see, you use 4 value) --increment-instance-enabled: true (the same)

And i pass next arguments via DSL(in your example you didn't use any arguments in DSL): --runner.localDate=2018-12-08 --spring.cloud.consul.config.datakey=calculate-vm-click-statistic --runner.mode=EXEC

a little example:

calculate-vm-click-statistic: multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=calculate-vm-click-statistic --runner.mode=EXEC && <average-
genre-statistic-calculation-online-vm: multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=average-genre-statistic-calculation-online-vm --runner.mode=EXEC
 || average-click-statistic-calculation-online-web: multirating-baseoperation --runner.localDate=2018-12-
08 --spring.cloud.consul.config.datakey=average-click-statistic-calculation-online-web --
runner.mode=EXEC || average-click-statistic-calculation-online-vm: multirating-baseoperation --
runner.localDate=2018-12-08 --spring.cloud.consul.config.datakey=average-click-statistic-calculation-
online-vm --runner.mode=EXEC || average-genre-statistic-calculation-online-web: multirating-
baseoperation --runner.localDate=2018-12-08 --spring.cloud.consul.config.datakey=average-genre-
statistic-calculation-online-web --runner.mode=EXEC> && average-genre-statistic-calculation-off: 
multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=average-genre-statistic-calculation-off --runner.mode=EXEC && 
fusion-v2: multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=fusion-v2 --runner.mode=EXEC && aggregation-transformation: 
multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=aggregation-transformation --runner.mode=EXEC && export-
infosys: multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=export-infosys --runner.mode=EXEC && combine-infosys: 
multirating-baseoperation --runner.localDate=2018-12-08 --
spring.cloud.consul.config.datakey=combine-infosys --runner.mode=EXEC

And the main difference in executed tasks, i use my custom task for all executions. It has next bootstrap.yaml(i use consul):

runner:
  localDate: **pass this argument via dsl**
  mode: **pass this argument via dsl**
spring:
  application:
    name: multi-rating-operations
  cloud:
    consul:
      config:
        watch:
          enabled: false
        enabled: true
        prefix: ""
        datakey: **pass this argument via dsl**
        format: yaml
      host: 10.101.48.150
      port: 8500
      discovery:
        prefer-ip-address: true
        enabled: false
  jpa:
    properties:
      hibernate:
        jdbc:
          lob:
            non_contextual_creation: true
  datasource:
    url: jdbc:postgresql://192.168.21.70:5432/data_flow
    username: xxxxxxxx
    password: xxxxxxxx
    driver-class-name: org.postgresql.Driver
logging:
  level:
    org:
      springframework:
        cloud:
          task: debug
dataBusRest:
  dataSourceUrl: 10.101.48.150
  user: xxxxxxxx
  password: xxxxxxxx
  port: 10888

I could try to debug CTR by my self. Could you share your metodology for me? Or i just need to download sources of CTR and try start it like you using java -jar command.

cppwfs commented 5 years ago

Are you including --spring.cloud.task.closecontextEnabled=true in your parameters? That is required.

Rostish commented 5 years ago

I will try after holidays in my country. I coudn't do it right now, because my code is availabe only from my work place.

cppwfs commented 5 years ago

I was able reproduce it somewhat.
Using the same graph and tooling except in this case I used a SCDF-Local to launch docker images like you discussed previously.
What occurred was after running the CTR instance 50 times one of the CTR executions appeared to stop. In this case one of the child apps failed to start because of the following error docker: Error response from daemon: driver failed programming external connectivity on endpoint stupefied_hypatia (fc8f22b557ad6dd9ea4c692792dab9e9259c0ae872cf02d1397409c99171f4d0): Error starting userland proxy: listen tcp 0.0.0.0:58386: bind: address already in use. This error appeared in the stderr log of the child task. So CTR was waiting for the child application to start which it never did and thus CTR was effectively blocked.
The solution to this is to set the max-wait-time as discussed here: https://github.com/spring-cloud-task-app-starters/composed-task-runner/blob/master/spring-cloud-starter-task-composedtaskrunner/README.adoc

Rostish commented 5 years ago

I had to go to work to check this))). It seems --spring.cloud.task.closecontextEnabled=true parameter helped to me. I did about 60 launches and CTR never stucks. Could you explain meaning of this parameter?

About docker. It looks like another bug, because i use docker only for SCDF-local deployment. And then use volume command to move custom tasks to container folder.

cppwfs commented 5 years ago

I'm glad that this resolved this issue for you. A brief discussion on the parameter can be found here: https://docs.spring.io/spring-cloud-task/docs/current-SNAPSHOT/reference/htmlsingle/#features-lifecycle CTR uses ThreadPoolTaskExecutor to manage splits in the graph, and thus the context remains open beyond the scope of the task. Thus this setting closes the context upon the completion in CTR. As of the release of CTR 2.1 the closeContextEnabled will be set by default. The other issue is not really a bug with SCDF or CTR.
I will go ahead and close this issue.