kube-HPC / hkube

🐟 High Performance Computing over Kubernetes - Core Repo 🎣
http://hkube.io
MIT License
306 stars 20 forks source link

pipeline with batch stuck with node "storing" #902

Closed tamir321 closed 4 years ago

tamir321 commented 4 years ago

HKube micro-service To which micro-service the bug related to, Dashboard ,Monitor-Server etc.

Describe the bug when executing multiple pipelines that include batches some of them get stuck when one of the node remain with status: "storing"

execute the following pipeline 4 times

{
    "name": "TooManyFiles",
    "nodes": [
        {
            "nodeName": "node2",
            "algorithmName": "python-rand1",
            "input": [
                "#[0...600]"
            ]
        },
        {
            "nodeName": "node3",
            "algorithmName": "python-rand2",
            "input": [
                "#[0...1000]",
                "@node2"
            ]
        },
        {
            "nodeName": "node4",
            "algorithmName": "python-rand3",
            "input": [
                "#[0...1000]"
            ]
        },
        {
            "nodeName": "node11",
            "algorithmName": "python-rand4",
            "input": [
                12,
                17,
                "@node2",
                "@node4",
                "@node2",
                "@node3",
                "@node4",
                "@node2",
                "@node3"
            ]
        }
    ],
    "experimentName": "main",
    "options": {
        "ttl": 3600,
        "batchTolerance": 80,
        "progressVerbosityLevel": "info"
    },
    "priority": 3
}

on the graph in node 2 the last task


{
taskId: "bold831m",
input: [
599
],
output: {
storageInfo: {
path: "test-hkube/main:TooManyFiles:3bqrc9qe/bold831m",
size: 11
},
metadata: {
node2: {
type: "int"
}
},
discovery: {
host: "10.233.68.192",
port: "9020"
},
taskId: "bold831m"
},
podName: "python-rand1-eguzwxbqwhqkt0jq5vgoy9ps01g0xw-r4hn6",
status: "storing",
batchIndex: 600,
startTime: 1597127353139
}
yehiyam commented 4 years ago

@tamir321 should be much better than before

tamir321 commented 4 years ago

tested on version systemVersion: "v1.3.107", fullSystemVersion: "v1.3.107-1598349651396", versions: [