Closed mshr-h closed 4 months ago
TVM v0.12.0's pytorch frontend failes to import HB's pytorch model in v1.12.0. But it's ok when use v0.11.1
I did git bisect and found that the below commit has a problem. [Relay][Frontend] Span Filling PyTorch (#14050) · apache/tvm@e9cf04e
Seems like there's a bug in the PyTorch frontend. Opened new issue https://github.com/apache/tvm/issues/16150.
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
1aed136
) 90.11% compared to head (c51e005
) 87.92%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Was that last failure transient, or did something new break unrelated to TVM? Please let me know if you are blocked on something and I can try to help!
The CI test has passed but I don't know why...
We'll wait for the TVM v0.15.0 which is going to be released on 22 Jan 2024. https://github.com/apache/tvm/issues/16277 After the release, we'll run the CI again.
CI is failing again.
When TVM was installed from the source, it was ok. But it was restored from the cache, the test_extra_trees_tvm_tree_trav_regressor_converter
test failed.
Ok log: https://github.com/microsoft/hummingbird/actions/runs/7692350896/job/20959123441 Failed log: https://github.com/microsoft/hummingbird/actions/runs/7693930724/job/20963638667
Seems like tests are a bit flaky when TVM is installed from the cache.
Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809
@ksaur @interesaaat Do you have any ideas on this problem?
Seems like tests are a bit flaky when TVM is installed from the cache.
Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809
@ksaur @interesaaat Do you have any ideas on this problem?
Hmmm, : line 1: 2222 Aborted (core dumped) pytest
since it's transient/"bit flaky", i wonder if it is a memory issue. have you ever seen it on your local machine or could get the coredump? And it's only when from cache?
Seems like tests are a bit flaky when TVM is installed from the cache.
Only failed on ubuntu+python3.9: https://github.com/microsoft/hummingbird/actions/runs/7710611530/job/21014292656 Only failed on ubuntu+python3.8: https://github.com/microsoft/hummingbird/actions/runs/7711820622/job/21017938809
@ksaur @interesaaat Do you have any ideas on this problem?
My 2c is that 30 for some reason is too much for TVM_MAX_FUSE_DEPTH
. Can we try with 10 and see if it passes?
Thank you for your comments.
@ksaur
have you ever seen it on your local machine or could get the coredump?
No.
And it's only when from cache?
It happens both from the source and from the cache.
@interesaaat
I tried TVM_MAX_FUSE_DEPTH: 10
but still flaky :(
1st run (Ok): https://github.com/microsoft/hummingbird/actions/runs/7722097282 2nd run (Ng): https://github.com/microsoft/hummingbird/actions/runs/7723327148
Both runs are tvm from the cache. And always failed only on ubuntu, no fail on macOS.
According to the GitHub official docs, the GitHub-hosted runners have 16GB RAM for Linux and 14GB RAM for macOS. https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
1st run (Ok): https://github.com/microsoft/hummingbird/actions/runs/7722097282 2nd run (Ng): https://github.com/microsoft/hummingbird/actions/runs/7723327148
Both runs are tvm from the cache. And always failed only on ubuntu, no fail on macOS.
Hmm from the Ng run:
Current thread 0x00007f3036869b80 (most recent call first):
Garbage-collecting
File "/home/runner/work/hummingbird/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 215 in __del__
File "/home/runner/work/hummingbird/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 119 in _make_tvm_args
...
I'm not exactly sure what it's up to, but it definitely looks like a memory issue!
According to the GitHub official docs, the GitHub-hosted runners have 16GB RAM for Linux and 14GB RAM for macOS.
Let me see if we can do large runners....
Ok I looked into large runners. For some reason that I cannot understand, we do not have access to Github Enterprise with this repo, and therefore can't setup large runners. As a backup plan, we could try self-hosted runners in our Azure sub, but that will take a long time to setup and we'd need to figure out a budget.
In the meantime, can you please try something even smaller? Like TVM_MAX_FUSE_DEPTH: 10
to 1
or something? (not sure what is a sensible value). Or smaller n_estimators
etc? (or just put 0 and it won't fuse, that should avoid OOM for sure!)
I think we have to check the max memory usage during the test on the local machine.
Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time.
Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.
pytest-memray provides a marker to limit memory usage on a specific test. Usage - pytest-memray
Can we do something like if the test try to allocate more than XX MB, finish the execution of the test and mark as skipped because memory usage are too high.
The memory issue is happening in the TVM side. I'll work on some more days to see if there's any way to solve it.
Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time.
Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.
Or maybe we can temporary disable all the TVM tests on Ubuntu. They're also tested on macOS.
Thanks for pushing so hard on this!! :) I'm wondering if it's too risky to do something like a retry on those tests? But that could take a long time. Maybe we could also mark those tests as ok-to-fail (just for the ubuntu TVM ones with the memory issues). Both of those ideas are a bit sloppy, but I'm not sure of a better way to unblock it.
Or maybe we can temporary disable all the TVM tests on Ubuntu. They're also tested on macOS.
I'm fine with this solution (skipping TVM tests on Ubu)! (Unless there are OS-specific TVM issues)? What do you think @interesaaat?
Note for the memory issue investigation
: line 1: 2222 Aborted (core dumped) pytest
n_estimators=5
) for test_random_forest_tvm_tree_trav_classifier_converter
.TVM_MAX_FUSE_DEPTH: 0
)systemd-run --scope -p MemoryMax=8G --user pytest
Hi @mshr-h thanks for all your work on this!! (Especially thank you for the nice summary of the memory issue.) Is the PR still a WIP? I like the is_on_github_actions
:) Please let us know if you get stuck or want us to take a look!!
Should be good.
@ksaur @interesaaat Thank you for your support! Can you review it?
This looks good to me!! @interesaaat take a look and merge if ready
TVM doesn't provide a stable version of a binary package for Python 3.11. So we'll build it from the source in the CI pipeline. Also, we'll update TVM to the latest stable release. Related to #643
TODO