mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize AI, ML and other applications and systems across diverse and continuously changing models, data, software and hardware
https://cKnowledge.org/install-cm-mlops
Apache License 2.0
600 stars 111 forks source link

Rclone unable to access remote directory #1134

Closed willamloo3192 closed 2 weeks ago

willamloo3192 commented 7 months ago

Command: rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P

Error Message: 2024-02-28 13:54:17 ERROR : : error reading source directory: directory not found 2024-02-28 13:54:17 ERROR : Attempt 1/3 failed with 1 errors and: directory not found 2024-02-28 13:54:18 ERROR : : error reading source directory: directory not found 2024-02-28 13:54:18 ERROR : Attempt 2/3 failed with 1 errors and: directory not found 2024-02-28 13:54:18 ERROR : : error reading source directory: directory not found 2024-02-28 13:54:18 ERROR : Attempt 3/3 failed with 1 errors and: directory not found Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA - Errors: 1 (retrying may help) Elapsed time: 1.8s 2024/02/28 13:54:18 Failed to copy: directory not found

gfursin commented 7 months ago

Thank you for reporting @willamloo3192 . There are multiple issues with the MLCommons cloud at the moment. I believe we had an alternative way to download models. Let me sync with @arjunsuresh today.

gfursin commented 7 months ago

Hi agian @willamloo3192 - actually you can't use rclone command like that because you need rclone config that will convert "mlc-inference" to an URL in the MLCommons cloud. CM is generating such config on the fly but we still need to fix the previous problem. I hope to provide fixes today ... Once again thank you for reporting!

Related: https://github.com/mlcommons/ck/issues/1136

gfursin commented 7 months ago

For rclone to work without CM, you need to run this command before to set up rclone keys: https://github.com/mlcommons/ck/blob/master/cm-mlops/script/get-ml-model-stable-diffusion/_cm.json#L159

willamloo3192 commented 7 months ago

Hi @gfursin

I tried this before, but still unable to make it. user@host:~$ rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com Remote config

[mlc-inference] provider=Cloudflare = access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

user@hsot:~$ rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P 2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found 2024-03-04 10:41:48 ERROR : Attempt 1/3 failed with 1 errors and: directory not found 2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found 2024-03-04 10:41:48 ERROR : Attempt 2/3 failed with 1 errors and: directory not found 2024-03-04 10:41:48 ERROR : : error reading source directory: directory not found 2024-03-04 10:41:48 ERROR : Attempt 3/3 failed with 1 errors and: directory not found Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA - Errors: 1 (retrying may help) Elapsed time: 2.4s 2024/03/04 10:41:48 Failed to copy: directory not found

gfursin commented 7 months ago

I have a feeling that it still related to your use of PROXY that rclone may not support (or may need explicit flags). I am checking it with @arjunsuresh and we will see if we can either emulate the environment with proxy to access internet or still try to provide a few possible solutions.

It seems that each tool handles proxy differently. I see in rclone docs that it may be needed to set up the following var to use proxy:

set HTTP_PROXY=...

Do you always have some env variables set in your environment to point to the proxy server? Is it a full URL with port? Is it different for HTTP and HTTPS ? I am just trying to see how we can set something like --proxy=yes in CM and then internally map some of your environment variables to the tool flags that are wrapped by CM ...

gfursin commented 7 months ago

@arjunsuresh - let's sync there too ...

willamloo3192 commented 7 months ago

I have a feeling that it still related to your use of PROXY that rclone may not support (or may need explicit flags). I am checking it with @arjunsuresh and we will see if we can either emulate the environment with proxy to access internet or still try to provide a few possible solutions.

It seems that each tool handles proxy differently. I see in rclone docs that it may be needed to set up the following var to use proxy:

set HTTP_PROXY=...

Do you always have some env variables set in your environment to point to the proxy server? Is it a full URL with port? Is it different for HTTP and HTTPS ? I am just trying to see how we can set something like --proxy=yes in CM and then internally map some of your environment variables to the tool flags that are wrapped by CM ...

To answer this, we have set the proxy parameter with both capitalized and uncapitalized. The outcome is still the same.

arjunsuresh commented 7 months ago

I believe the config should be as follows:

$ cat ~/.config/rclone/rclone.conf
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
willamloo3192 commented 7 months ago

I believe the config should be as follows:

$ cat ~/.config/rclone/rclone.conf
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

I have a quick check on my config file and i found some divergence based on what you provided.

[mlc-inference]
type = s3
`provider=Cloudflare` = access_key_id=f65ba5eef400db161ea49967de89f47b
`secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b` = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

After applied suggested config from you, I found some errors due to timeout.

2024/03/05 08:24:09 ERROR : S3 bucket mlcommons-inference-wg-public path stable_diffusion_fp16: error reading source root directory: RequestError: send request failed
caused by: Get "https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com/mlcommons-inference-wg-public?delimiter=%2F&encoding-type=url&list-type=2&max-keys=1000&prefix=stable_diffusion_fp16%2F": tls: failed to verify certificate: x509: certificate signed by unknown authority
arjunsuresh commented 7 months ago

Does adding this option help? --ftp-no-check-certificate ?

willamloo3192 commented 7 months ago

--ftp-no-check-certificate

it doesn't help out with the command rclone copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 ./stable_diffusion_fp32 -P --ftp-no-check-certificate

willamloo3192 commented 7 months ago

@arjunsuresh I tried with --no-check-certificate flag, it seems works but I would like to get your assistance to verify whether the file size is correct or not.

user@host:~/CM/repos/local/cache/24889d8c0a934aec/inference$ rclone --no-check-certificate copy mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp16 ./stable_diffusion_fp16 -P
Transferred:       35.474k / 35.474 kBytes, 100%, 36.082 kBytes/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:         3.6s
arjunsuresh commented 7 months ago

Unfortunately don't think so as the file is supposed to be in GBs. I believe this is the proxy issue. Does this link help?

willamloo3192 commented 7 months ago

Unfortunately don't think so as the file is supposed to be in GBs. I believe this is the proxy issue. Does this link help?

Doesn't help as the HTTP_PROXY and HTTPS_PROXY I had set in the environment variable. Might need your assistance.

willamloo3192 commented 7 months ago

Anyhow using the latest CM repo, I'm still unable to download the model.

Command: cm run script --tags=get,ml-model,sdxl,_fp32,_rclone -j

rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
[mlc-inference]
type = s3
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32 -P
Transferred:       35.474 KiB / 35.474 KiB, 100%, 0 B/s, ETA -
Transferred:            1 / 1, 100%
Elapsed time:         1.7s
           ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/download-file/customize.py
         ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/download-and-extract/customize.py
       ! call "postprocess" from /home/user/CM/repos/mlcommons@ck/cm-mlops/script/get-ml-model-stable-diffusion/customize.py

{
  "return": 0,
  "env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32"
  },
  "new_env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32"
  },
  "state": {},
  "new_state": {},
  "deps": [
    "download-and-extract,_rclone,_url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"
  ]
}

S
table diffusion checkpoint path: /home/user/CM/repos/local/cache/cf5c9a6a4e824118/stable_diffusion_fp32

The model in the huggingface.co seems like have different filename. I'm not sure whether my suspection is correct or not. image

arjunsuresh commented 7 months ago
Transferred:       35.474 KiB / 35.474 KiB, 100%, 0 B/s, ETA -
Transferred:            1 / 1, 100%

This means nothing really got downloaded. Since rclone download with proxy is not working outside of CM, it won't work via CM either. But we do see people using rclone behind proxy without any special settings in some MLPerf submissions. So I'm not sure what's the issue at your end.

willamloo3192 commented 7 months ago

From my side, we set the proxy via HTTP_PROXY and HTTPS_PROXY at environment variable and apt config file, then we are able to download file via wget with flag --no-check-certificate and install package via apt install xxx

For rclone wise, I'm kind out of idea

arjunsuresh commented 7 months ago

Unfortunately we are also not entirely sure there as we just wrap the rclone command. We don't have an environment similar to yours to test further either.

willamloo3192 commented 7 months ago

Would you rerun the same command and share with me the output of the CM? Thanks.

arjunsuresh commented 7 months ago

It is still ongoing...

[cmuser@e761b48fa277 ~]$ cm run script --tags=get,ml-model,sdxl,_fp32,_rclone -j

* cm run script "get ml-model sdxl _fp32 _rclone"
=================================================
WARNINGS:

  Required disk space: 13000 MB
=================================================

  * cm run script "download-and-extract _rclone _url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"

    * cm run script "download file _rclone _url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"

      * cm run script "detect os"
             ! cd /home/cmuser/CM/repos/local/cache/194c9e164d68412b
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/run.sh from tmp-run.sh
             ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/customize.py

      * cm run script "get rclone"

        * cm run script "detect os"
               ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
               ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/run.sh from tmp-run.sh
               ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/detect-os/customize.py
        - Searching for versions:  == 1.65.2
                 ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
                 ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/run.sh from tmp-run.sh
/home/cmuser/.local/bin:/home/cmuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/cmuser/.local/bin
rclone was not detected
      Downloading https://downloads.rclone.org/v1.65.2/rclone-v1.65.2-linux-amd64.zip
             ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/install.sh from tmp-run.sh
--2024-03-06 03:04:18--  https://downloads.rclone.org/v1.65.2/rclone-v1.65.2-linux-amd64.zip
Resolving downloads.rclone.org (downloads.rclone.org)... 95.217.6.16, 2a01:4f9:c012:7154::1
Connecting to downloads.rclone.org (downloads.rclone.org)|95.217.6.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20348123 (19M) [application/zip]
Saving to: ‘rclone-v1.65.2-linux-amd64.zip’

rclone-v1.65.2-linux-amd64.zip                             100%[=======================================================================================================================================>]  19.41M  3.70MB/s    in 6.1s

2024-03-06 03:04:26 (3.17 MB/s) - ‘rclone-v1.65.2-linux-amd64.zip’ saved [20348123/20348123]

Archive:  rclone-v1.65.2-linux-amd64.zip
   creating: rclone-v1.65.2-linux-amd64/
  inflating: rclone-v1.65.2-linux-amd64/rclone.1
  inflating: rclone-v1.65.2-linux-amd64/README.txt
  inflating: rclone-v1.65.2-linux-amd64/README.html
  inflating: rclone-v1.65.2-linux-amd64/git-log.txt
  inflating: rclone-v1.65.2-linux-amd64/rclone
             ! cd /home/cmuser/CM/repos/local/cache/8d72574a4a69426f
             ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/run.sh from tmp-run.sh
/home/cmuser/CM/repos/local/cache/8d72574a4a69426f:/home/cmuser/.local/bin:/home/cmuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/cmuser/.local/bin
             ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-rclone/customize.py
          Detected version: 1.65.2

Downloading from mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32
           ! cd /home/cmuser/CM/repos/local/cache/194c9e164d68412b
           ! call /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-file/run.sh from tmp-run.sh

rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
2024/03/06 03:04:27 NOTICE: Config file "/home/cmuser/.config/rclone/rclone.conf" not found - using defaults
[mlc-inference]
type = s3
endpoint = https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
provider = Cloudflare
access_key_id = f65ba5eef400db161ea49967de89f47b
secret_access_key = fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b

rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32 -P
Transferred:        1.708 GiB / 12.926 GiB, 13%, 72.797 MiB/s, ETA 2m37s
Transferred:           17 / 19, 89%
Elapsed time:        25.2s
Transferring:
 * checkpoint_pipe/unet/d…orch_model.safetensors:  6% /9.565Gi, 29.152Mi/s, 5m15s
 * checkpoint_pipe/text_e…er_2/model.safetensors: 13% /2.588Gi, 21.794Mi/s, 1m45s
willamloo3192 commented 7 months ago

Awesome! Did you mind to share with me your environment variable? I just want to make apple-to-apple comparison.

arjunsuresh commented 7 months ago

I'm not having any proxy in use. It is actually a clean docker running RHEL 8.

arjunsuresh commented 7 months ago

The final output

rclone sync mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32 /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32 -P
Transferred:       12.926 GiB / 12.926 GiB, 100%, 530.534 KiB/s, ETA 0s
Transferred:           19 / 19, 100%
Elapsed time:     11m33.4s
           ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-file/customize.py
         ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/download-and-extract/customize.py
       ! call "postprocess" from /home/cmuser/CM/repos/ctuning@mlcommons-ck/cm-mlops/script/get-ml-model-stable-diffusion/customize.py

{
  "return": 0,
  "env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32"
  },
  "new_env": {
    "CM_ML_MODEL_DATASET": "openorca",
    "CM_ML_MODEL_WEIGHT_TRANSFORMATIONS": "no",
    "CM_ML_MODEL_INPUT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_PRECISION": "fp32",
    "CM_ML_MODEL_WEIGHT_DATA_TYPES": "fp32",
    "CM_ML_MODEL_STARTING_WEIGHTS_FILENAME": "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    "CM_ML_MODEL_FRAMEWORK": "pytorch",
    "CM_ML_MODEL_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32",
    "SDXL_CHECKPOINT_PATH": "/home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32"
  },
  "state": {},
  "new_state": {},
  "deps": [
    "download-and-extract,_rclone,_url.mlc-inference:mlcommons-inference-wg-public/stable_diffusion_fp32"
  ]
}

Stable diffusion checkpoint path: /home/cmuser/CM/repos/local/cache/194c9e164d68412b/stable_diffusion_fp32
willamloo3192 commented 7 months ago

I see. Okay, I have to consult my company's IT department how to unblock it.

arjunsuresh commented 2 weeks ago

Closing this issue for now. Please reopen if required.