pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.22k stars 863 forks source link

Can two workflows share the same model with each other? #1103

Open yurkoff-mv opened 3 years ago

yurkoff-mv commented 3 years ago

Continuing my previous post: How i do models chain processing and batch processing for analyzing text data?

Can I create two workflows using the same RoBERTa base model to perform two different tasks, let's say the classifier_model and summarizer_model? I would like to be able to share the base model with two workflows.

I am trying to register two workflows: wf_classifier.war and wf_summarizer.war. The first one is registered and the second one is not.

log.log

wf_classifier.war

models:
    min-workers: 1
    max-workers: 1
    batch-size: 1
    max-batch-delay: 1000
    retry-attempts: 5
    timeout-ms: 300000

    roberta:
      url: roberta_base.mar

    classifier:
      url: classifier.mar

dag:
  roberta: [classifier]

wf_summarizer.war

models:
    min-workers: 1
    max-workers: 1
    batch-size: 1
    max-batch-delay: 1000
    retry-attempts: 5
    timeout-ms: 300000

    roberta:
      url: roberta_base.mar

    summarizer:
      url: summarizer.mar

dag:
  roberta_base: [summarizer]
HamidShojanazeri commented 3 years ago

@yurkoff-mv I believe this should be possible to handle in one DAG, having roberta-base in the preprocessing node and sending the embeddings to classifier and summarizer, @maaquib would that be a potential solution?

maaquib commented 3 years ago

@HamidShojanazeri @yurkoff-mv I was able to register both workflows in the nmt transformers simultaneously and run inferences on both. They share the same model mar file but are registered as different nodes.

@yurkoff-mv Are you getting any excpetions when registering the second workflow?

> ll model_store/
total 8231752
drwxrwxr-x 2 ubuntu ubuntu       4096 Jul  9 18:01 ./
drwxrwxr-x 6 ubuntu ubuntu       4096 Jul  9 18:01 ../
-rw-rw-r-- 1 ubuntu ubuntu 2992637547 Jul  9 17:37 TransformerDe2En.mar
-rw-rw-r-- 1 ubuntu ubuntu 2989110679 Jul  9 17:42 TransformerEn2De.mar
-rw-rw-r-- 1 ubuntu ubuntu 2447539623 Jul  9 17:53 TransformerEn2Fr.mar

> curl http://127.0.0.1:8081/workflows
{
  "workflows": [
    {
      "workflowName": "nmt_wf_dual",
      "workflowUrl": "nmt_wf_dual.war"
    },
    {
      "workflowName": "nmt_wf_re",
      "workflowUrl": "nmt_wf_re.war"
    }
  ]
}

> curl "http://127.0.0.1:8081/models"
{
  "models": [
    {
      "modelName": "nmt_wf_dual__nmt_en_de",
      "modelUrl": "TransformerEn2De.mar"
    },
    {
      "modelName": "nmt_wf_dual__nmt_en_fr",
      "modelUrl": "TransformerEn2Fr.mar"
    },
    {
      "modelName": "nmt_wf_re__nmt_de_en",
      "modelUrl": "TransformerDe2En.mar"
    },
    {
      "modelName": "nmt_wf_re__nmt_en_de",
      "modelUrl": "TransformerEn2De.mar"
    }
  ]
}
HamidShojanazeri commented 3 years ago

@yurkoff-mv it would be also to good to know if you had the chance to try in one DAG, using roberta as feature extractor in preprocessing node and send the data to classifier and summarizer, if you could share the models/ steps we could give it a try as well.

yurkoff-mv commented 3 years ago

@maaquib, That's right, you got two models instead of one. Now imagine that the model is large. And will there be enough GPU resources to service two identical models?

In this case, I would like to get the following result:

> curl "http://127.0.0.1:8081/models"
{
  "models": [
    {
      "modelName": "nmt_en_de",
      "modelUrl": "TransformerEn2De.mar"
    },
    {
      "modelName": "nmt_wf_dual__nmt_en_fr",
      "modelUrl": "TransformerEn2Fr.mar"
    },
    {
      "modelName": "nmt_wf_re__nmt_de_en",
      "modelUrl": "TransformerDe2En.mar"
    },
  ]
}
yurkoff-mv commented 3 years ago

@HamidShojanazeri, I do not know how to do that. How in one DAG send the data to classifier and summarizer. This is exactly part of my question.

samils7 commented 2 years ago

I also need to do that.

@maaquib, That's right, you got two models instead of one. Now imagine that the model is large. And will there be enough GPU resources to service two identical models?

In this case, I would like to get the following result:

> curl "http://127.0.0.1:8081/models"
{
  "models": [
    {
      "modelName": "nmt_en_de",
      "modelUrl": "TransformerEn2De.mar"
    },
    {
      "modelName": "nmt_wf_dual__nmt_en_fr",
      "modelUrl": "TransformerEn2Fr.mar"
    },
    {
      "modelName": "nmt_wf_re__nmt_de_en",
      "modelUrl": "TransformerDe2En.mar"
    },
  ]
}