Closed zkbitcoin closed 2 months ago
by overriding after https://github.com/ingonyama-zk/icicle/blob/main/icicle/src/msm/msm.cu#L837
setting c = 8;
bench was able to run bit further (bit slower but not by huge amounts)
but then it ran out of memory on larger size attempt at:
bls12_377 MSM / 131072 x 16 with precomp = 1
card is not that old, trying now with setting c = 4 but its hacky as is not done by env variable or similar
I will let others comment more to the point about your idea with ICICLE_BATCH_SIZE . Just wanted to say that card is from 2017, which is old.
I know.. I feel bad.. technology moves so fast.. thank you so much.. trying that c = 4 like with many other projects, doing it for education research learning.. will let know if 4 succeeds
so far and lol but expected
bls12_377 MSM / 32768 x 1 with precomp = 1
time: [605.63 ms 605.70 ms 605.77 ms]
change: [+507.24% +507.82% +508.40%] (p = 0.00 < 0.05)
Performance has regressed.
@omershlo
one could say there is a limit at which only more modern cards should be run with
@zkbitcoin you can limit the benchmark size by setting MAX_LOG2 and MIN_LOG2 env vars in your runs
@zkbitcoin you can limit the benchmark size by setting MAX_LOG2 and MIN_LOG2 env vars in your runs
just did it at the very time you made this comment
export MAX_LOG2=15 succeeds, its flexible or was aleady, I can close this ticket if you want "functions as designed"
functions as designed, use MAX_LOG2 and MIN_LOG2 environmental variables to set limits lower for any given card
Description
Please provide a clear and concise description of the bug.
/git/icicle/wrappers/rust/icicle-curves/icicle-bls12-377
cargo build ; cargo bench
out of memory message:
Benchmarking bls12_377 MSM / 65536 x 16 with precomp = 1: Collecting 10 samples in estimated 12.647 s (10 ite bls12_377 MSM / 65536 x 16 with precomp = 1
time: [1.4704 s 1.4706 s 1.4708 s] change: [-0.0163% +0.0110% +0.0335%] (p = 0.43 > 0.05) No change in performance detected. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild Benchmarking bls12_377 MSM / 65536 x 128 with precomp = 1: Warming up for 3.0000 sCUDA Runtime Error by: cudaMallocAsync(&sort_indices_temp_storage, sort_indices_temp_storage_bytes, stream) at: /home/amilkowski/git/icicle/icicle/src/msm/msm.cu:450 out of memory
CUDA Runtime Error by: bucket_method_msm( bitsize, c, scalars, points, config.batch_size, msm_size, (config.points_size == 0) ? msm_size : config.points_size, results, config.are_scalars_on_device, config.are_scalars_montgomery_form, config.are_points_on_device, config.are_points_montgomery_form, config.are_results_on_device, config.is_big_triangle, config.large_bucket_factor, config.precompute_factor, config.is_async, stream) at: /home/amilkowski/git/icicle/icicle/src/msm/msm.cu:845 out of memory
thread 'main' panicked at icicle-curves/icicle-bls12-377/benches/msm.rs:5:1: called
Result::unwrap()
on anErr
value: IcicleError { icicle_error_code: InternalCudaError, cuda_error: Some(cudaErrorMemoryAllocation), reason: Some("Runtime CUDA error.") }Reproduce
using same GPU card and environment repeat steps above
Expected Behavior
software should give one ability to work within memory limit of a card, for instance ability to specify batch_size lower or any other variables either via ie export ICICLE_BATCH_SIZE = 8 etc
Please provide a clear and concise description of what you expected to happen.
Environment
linux
Please complete the following information:
OS + Version:
Ubuntu 22.04.4 LTS
Cargo Version:
1.78.0
GPU type:
[GeForce GTX 1080 Ti]
Additional context
Please provide any additional context that may be helpful in confirming and resolving this issue.
I think just an ability to override batch size (others?) would be helpful just to keep testing /benchmarking going