Open rmshin opened 9 months ago
Sorry to disturb. I suppose issues may be ignored, so I attempt to ask @rmshin for help.
When working on 149gpt, I found some unknown symbols from module_ref.so
, which makes references unavailable. I believe it is about version of pytorch. However, the README did't cover that. I would apprecitate it if anyone could provide some info about the setup. (maybe the pytorch version that works)
p.s. it works when comment all of ms
related code.
undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
echo "_ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE" | c++filt
# output
at::_ops::zeros::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)
hey @Malfurionzz , I don't know if this will help but I ran all the assignments with the cuda:12.3.1-devel-ubuntu22.04 docker image, and simply installed the relevant libraries without versions:
apt install python3-pip
pip3 install torch ninja tiktoken
git clone https://github.com/rmshin/cs149gpt.git && cd cs149gpt
Hope that helps, happy to pair on this at some point if you can't get it working on your end :)
Very nice of you. I tried the default torch version with the docker image, but it doesn't help. However, I luckily found torch==2.1.2 works (for both dokcer and phy environment). Hope it may help anyone need it.
# ubuntu 22.04
# pip install torch==2.1.2
Very nice of you. I tried the default torch version with the docker image, but it doesn't help. However, I luckily found torch==2.1.2 works (for both dokcer and phy environment). Hope it may help anyone need it.
# ubuntu 22.04 # pip install torch==2.1.2
Your method solved my problem, thank you very much for your help!
Hello, after completing the various parts of the cs149gpt assignment, I observed some interesting performance numbers for my solutions vs. the reference implementations that I'd like to better understand. If anybody would be able to help provide more insight into why I'm seeing the metrics that I'm seeing given my code, that would be highly appreciated!
For the NAIVE ATTENTION implementation expected in
part1
, I wrote fairly standard nested looping code with the only exception being that I also applied loop-reordering to ensure sequential access of the matrices manipulated within the inner-most loops.The above code produces metrics that are significantly faster (~2x) than the reference:
This speed-up carries through for larger values of N (e.g.
-N 4096
):Now my understanding is that a speedup is expected given that re-ordering the inner loops makes better use of the CPU cache by operating column-wise on matrices (and therefore having far more cache-hits). What's interesting however is when I then run the blocked version of the same code for
part2
:I see the following performance metrics:
And for larger
-N 4096
:There are a couple things to note here that I find confusing.
First, the reference solution itself shows a much smaller speedup in my execution environment between part 1 & 2 than indicated in the
README.md
(12~15% vs. >30%). I assume this has something to do with the difference in underlying hardware, as I ran the test scripts on a rented cloud machine with an AMD EPYC 7302P 16-core processor, though I'm not really sure this is true. Would there be any other explanations/reasons for why the reference solution doesn't show similar performance improvements as indicated in the repo README?Second, the blocked version of my solution (with loop re-ordering) shows even less speedup than the reference - small enough that it seems almost negligible for both smaller and larger values of N (6~8%). This is of course in relative comparison to the loop re-ordering which, while extremely simple to implement, led to a >2x speedup against the reference. Implementing cache-aware blocked matmul seems to provide almost no additional benefit when loop re-ordering is in place. Is this to be expeceted?
Even though loop re-ordering addresses the issue of high cache misses in row-wise matmul, I'd have expected that, given the small size of L1 caches, blocking would offer fewer overall accesses to main memory and therefore measurably faster processing times? From my experiments & observations thus far, it seems matrix tiling is not worth the extra effort if the inner loops are already re-ordered to be CPU cache-friendly.
Any further guidance on this topic would be very helpful!