Update apache-tvm to v0.15.0

mshr-h commented 1 year ago

TVM doesn't provide a stable version of a binary package for Python 3.11. So we'll build it from the source in the CI pipeline. Also, we'll update TVM to the latest stable release. Related to #643

TODO

[x] run pipeline again after the TVM v0.15.0 release
[x] cleanup pipeline

mshr-h commented 1 year ago

TVM v0.12.0's pytorch frontend failes to import HB's pytorch model in v1.12.0. But it's ok when use v0.11.1

mshr-h commented 1 year ago

I did git bisect and found that the below commit has a problem. [Relay][Frontend] Span Filling PyTorch (#14050) · apache/tvm@e9cf04e

mshr-h commented 7 months ago

Seems like there's a bug in the PyTorch frontend. Opened new issue https://github.com/apache/tvm/issues/16150.

codecov-commenter commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (1aed136) 90.11% compared to head (c51e005) 87.92%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #709 +/- ## ========================================== - Coverage 90.11% 87.92% -2.20% ========================================== Files 80 80 Lines 4685 4687 +2 Branches 857 857 ========================================== - Hits 4222 4121 -101 - Misses 266 368 +102 - Partials 197 198 +1 ``` | [Flag](https://app.codecov.io/gh/microsoft/hummingbird/pull/709/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=microsoft) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/microsoft/hummingbird/pull/709/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=microsoft) | `87.92% <100.00%> (-2.20%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=microsoft#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ksaur commented 6 months ago

Was that last failure transient, or did something new break unrelated to TVM? Please let me know if you are blocked on something and I can try to help!

mshr-h commented 6 months ago

The CI test has passed but I don't know why...

mshr-h commented 6 months ago

We'll wait for the TVM v0.15.0 which is going to be released on 22 Jan 2024. https://github.com/apache/tvm/issues/16277 After the release, we'll run the CI again.

mshr-h commented 5 months ago

CI is failing again. When TVM was installed from the source, it was ok. But it was restored from the cache, the test_extra_trees_tvm_tree_trav_regressor_converter test failed.

Ok log: https://github.com/microsoft/hummingbird/actions/runs/7692350896/job/20959123441 Failed log: https://github.com/microsoft/hummingbird/actions/runs/7693930724/job/20963638667

mshr-h commented 5 months ago

Seems like tests are a bit flaky when TVM is installed from the cache.

Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809

@ksaur @interesaaat Do you have any ideas on this problem?

ksaur commented 5 months ago

Seems like tests are a bit flaky when TVM is installed from the cache.

Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809

@ksaur @interesaaat Do you have any ideas on this problem?

Hmmm, : line 1: 2222 Aborted (core dumped) pytest since it's transient/"bit flaky", i wonder if it is a memory issue. have you ever seen it on your local machine or could get the coredump? And it's only when from cache?

interesaaat commented 5 months ago

Seems like tests are a bit flaky when TVM is installed from the cache.

Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809

@ksaur @interesaaat Do you have any ideas on this problem?

My 2c is that 30 for some reason is too much for TVM_MAX_FUSE_DEPTH. Can we try with 10 and see if it passes?

mshr-h commented 5 months ago

Thank you for your comments.

@ksaur

have you ever seen it on your local machine or could get the coredump?

No.

And it's only when from cache?

It happens both from the source and from the cache.

@interesaaat I tried TVM_MAX_FUSE_DEPTH: 10 but still flaky :(

1st run (Ok): https://github.com/microsoft/hummingbird/actions/runs/7722097282 2nd run (Ng): https://github.com/microsoft/hummingbird/actions/runs/7723327148

Both runs are tvm from the cache. And always failed only on ubuntu, no fail on macOS.

mshr-h commented 5 months ago

According to the GitHub official docs, the GitHub-hosted runners have 16GB RAM for Linux and 14GB RAM for macOS. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

ksaur commented 5 months ago

1st run (Ok): https://github.com/microsoft/hummingbird/actions/runs/7722097282 2nd run (Ng): https://github.com/microsoft/hummingbird/actions/runs/7723327148

Both runs are tvm from the cache. And always failed only on ubuntu, no fail on macOS.

Hmm from the Ng run:

Current thread 0x00007f3036869b80 (most recent call first):
  Garbage-collecting
  File "/home/runner/work/hummingbird/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 215 in __del__
  File "/home/runner/work/hummingbird/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 119 in _make_tvm_args
  ...

I'm not exactly sure what it's up to, but it definitely looks like a memory issue!

According to the GitHub official docs, the GitHub-hosted runners have 16GB RAM for Linux and 14GB RAM for macOS.

Let me see if we can do large runners....

ksaur commented 5 months ago

Ok I looked into large runners. For some reason that I cannot understand, we do not have access to Github Enterprise with this repo, and therefore can't setup large runners. As a backup plan, we could try self-hosted runners in our Azure sub, but that will take a long time to setup and we'd need to figure out a budget.

In the meantime, can you please try something even smaller? Like TVM_MAX_FUSE_DEPTH: 10 to 1 or something? (not sure what is a sensible value). Or smaller n_estimators etc? (or just put 0 and it won't fuse, that should avoid OOM for sure!)

mshr-h commented 5 months ago

I think we have to check the max memory usage during the test on the local machine.

ksaur commented 5 months ago

Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time.

Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.

mshr-h commented 5 months ago

pytest-memray provides a marker to limit memory usage on a specific test. Usage - pytest-memray

Can we do something like if the test try to allocate more than XX MB, finish the execution of the test and mark as skipped because memory usage are too high.

mshr-h commented 5 months ago

The memory issue is happening in the TVM side. I'll work on some more days to see if there's any way to solve it.

mshr-h commented 5 months ago

Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time.

Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.

Or maybe we can temporary disable all the TVM tests on Ubuntu. They're also tested on macOS.

ksaur commented 5 months ago

Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time. Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.

Or maybe we can temporary disable all the TVM tests on Ubuntu. They're also tested on macOS.

I'm fine with this solution (skipping TVM tests on Ubu)! (Unless there are OS-specific TVM issues)? What do you think @interesaaat?

mshr-h commented 4 months ago

Note for the memory issue investigation

When does it happen?
- when running pytest with extra dependencies on Ubuntu+python 3.8-3.11 on GitHub Actions.
- Not happen on macOS or Windows.
- Sometimes it passes, sometimes it fails.
- Haven't seen a similar issue on my local Ubuntu machine.
What's the error message look like?
- : line 1: 2222 Aborted (core dumped) pytest
- Link to a pipeline log: https://github.com/microsoft/hummingbird/actions/runs/7723327148/job/21053180461
- It seems like the error is happening when the TVM tries to build the module.
What I tried?
- smaller estimators (n_estimators=5) for test_random_forest_tvm_tree_trav_classifier_converter.
- disabled op fusion for TVM (TVM_MAX_FUSE_DEPTH: 0)
- limited memory usage by systemd e.g. systemd-run --scope -p MemoryMax=8G --user pytest
- linked LLVM statically when TVM build
- updated PyTorch to 2.2.0

ksaur commented 4 months ago

Hi @mshr-h thanks for all your work on this!! (Especially thank you for the nice summary of the memory issue.) Is the PR still a WIP? I like the is_on_github_actions :) Please let us know if you get stuck or want us to take a look!!

mshr-h commented 4 months ago

Should be good.

@ksaur @interesaaat Thank you for your support! Can you review it?

ksaur commented 4 months ago

This looks good to me!! @interesaaat take a look and merge if ready

microsoft / hummingbird

Update apache-tvm to v0.15.0 #709

Codecov Report