h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.42k stars 1.25k forks source link

Mac M1 is crashing #1359

Closed kamrancr7 closed 8 months ago

kamrancr7 commented 9 months ago

while executing the following command my mac M1 is crashing

kamranali@Kamrans-MacBook-Pro h2o % python3 generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF --prompt_type=zephyr --max_seq_len=4096
soundfile, librosa, and wavio not installed, disabling STT
soundfile, librosa, and wavio not installed, disabling TTS
Using Model llama
Must install DocTR and LangChain installed if enabled DocTR, disabling
Starting get_model: llama 
Already have llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf from url https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf?download=true, delete file if invalid
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = huggingfaceh4_zephyr-7b-beta
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4096.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   900.27 MiB, offs =   4187422720, ( 4996.33 /  5461.34)
warning: failed to mlock 5131419648-byte buffer (after previously locking 0 bytes): Resource temporarily unavailable
llm_load_tensors: system memory used  = 4893.10 MiB
.............................................................................................warning: failed to mlock 107528192-byte buffer (after previously locking 4817338368 bytes): Resource temporarily unavailable
......
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   512.00 MiB, ( 5508.95 /  5461.34)ggml_backend_metal_buffer_type_alloc_buffer: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 5508.97 /  5461.34)ggml_backend_metal_buffer_type_alloc_buffer: warning: current allocated size is greater than the recommended max working set size
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 579.20 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   576.02 MiB, ( 6084.97 /  5461.34)ggml_backend_metal_buffer_type_alloc_buffer: warning: current allocated size is greater than the recommended max working set size
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
ggml_metal_graph_compute: command buffer 3 failed with status 5
GGML_ASSERT: /private/var/folders/04/4cp_4dx92nzd02j3kmgjhhhm0000gn/T/pip-install-_sop_2s8/llama-cpp-python_5d55959272714b1697d8e57d22b5373b/vendor/llama.cpp/ggml-metal.m:2385: false
Fatal Python error: Aborted

Current thread 0x00000001eb6a2100 (most recent call first):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama_cpp.py", line 1491 in llama_decode
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 469 in decode
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 1069 in eval
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 1248 in generate
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 1473 in _create_completion
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 1947 in create_completion
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/llama_cpp/llama.py", line 2014 in __call__
  File "/Users/kamranali/Development/ML_Project/h2o/src/gpt4all_llm.py", line 195 in get_llm_gpt4all
  File "/Users/kamranali/Development/ML_Project/h2o/src/gpt4all_llm.py", line 30 in get_model_tokenizer_gpt4all
  File "/Users/kamranali/Development/ML_Project/h2o/src/gen.py", line 2763 in get_model
  File "/Users/kamranali/Development/ML_Project/h2o/src/gen.py", line 2351 in get_model_retry
  File "/Users/kamranali/Development/ML_Project/h2o/src/gen.py", line 2023 in main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fire/core.py", line 691 in _CallAndUpdateTrace
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fire/core.py", line 475 in _Fire
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/fire/core.py", line 141 in Fire
  File "/Users/kamranali/Development/ML_Project/h2o/src/utils.py", line 65 in H2O_Fire
  File "/Users/kamranali/Development/ML_Project/h2o/generate.py", line 12 in entrypoint_main
  File "/Users/kamranali/Development/ML_Project/h2o/generate.py", line 16 in <module>

Extension modules: simplejson._speedups, charset_normalizer.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, lz4._version, lz4.frame._frame, psutil._psutil_osx, psutil._psutil_posix, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image, yaml._yaml, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, google._upb._message, grpc._cython.cygrpc, _cffi_backend, greenlet._greenlet, sqlalchemy.cyextension.collections, sqlalchemy.cyextension.immutabledict, sqlalchemy.cyextension.processors, sqlalchemy.cyextension.resultproxy, sqlalchemy.cyextension.util, PIL._webp, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, regex._regex, sklearn.__check_build._check_build, sklearn.utils.murmurhash, sklearn.utils._isfinite, sklearn.utils._openmp_helpers, sklearn.utils._vector_sentinel, sklearn.feature_extraction._hashing_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.utils._cython_blas, sklearn.svm._libsvm, sklearn.svm._liblinear, sklearn.svm._libsvm_sparse, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.utils.arrayfuncs, sklearn.utils._typedefs, sklearn.utils._readonly_array_wrapper, sklearn.metrics._dist_metrics, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.datasets._svmlight_format_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils (total: 233)
zsh: abort      python3 generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF  
kamranali@Kamrans-MacBook-Pro h2o % clear

kamranali@Kamrans-MacBook-Pro h2o % python3 generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF --prompt_type=zephyr --max_seq_len=4096 --load_8bit=True
soundfile, librosa, and wavio not installed, disabling STT
soundfile, librosa, and wavio not installed, disabling TTS
Using Model llama
Must install DocTR and LangChain installed if enabled DocTR, disabling
Starting get_model: llama 
Already have llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf from url https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf?download=true, delete file if invalid
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = huggingfaceh4_zephyr-7b-beta
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4096.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   900.27 MiB, offs =   4187422720, ( 4996.33 /  5461.34)
warning: failed to mlock 5131419648-byte buffer (after previously locking 0 bytes): Resource temporarily unavailable
llm_load_tensors: system memory used  = 4893.10 MiB
.............................................................................................
pseudotensor commented 9 months ago
ggml_metal_graph_compute: command buffer 3 failed with status 5
GGML_ASSERT: /private/var/folders/04/4cp_4dx92nzd02j3kmgjhhhm0000gn/T/pip-install-_sop_2s8/llama-cpp-python_5d55959272714b1697d8e57d22b5373b/vendor/llama.cpp/ggml-metal.m:2385: false
Fatal Python error: Aborted

@Mathanraj-Sharma Can you check? Maybe need to update llama_cpp_python or something.

Mathanraj-Sharma commented 8 months ago

@kamrancr7 I checked with the latest main branch following instructions at https://github.com/h2oai/h2ogpt/blob/main/docs/README_MACOS.md

I did not see any error, please re-open it if you still have issues.

cc: @pseudotensor

  ~/Desktop/h2ogpt on   main !1 ?6 ❯ python generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF --prompt_type=zephyr --max_seq_len=4096                                   ✘ INT  h2ogpt
soundfile, librosa, and wavio not installed, disabling STT
soundfile, librosa, and wavio not installed, disabling TTS
Using Model llama
Starting get_model: llama 
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from llamacpp_path/zephyr-7b-beta.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = huggingfaceh4_zephyr-7b-beta
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4807.08 MiB, ( 4807.14 / 10922.67)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      Metal buffer size =  4807.07 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/mathanraj/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   512.00 MiB, ( 5320.95 / 10922.67)
llama_kv_cache_init:      Metal KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    36.04 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   592.02 MiB, ( 5912.97 / 10922.67)
llama_new_context_with_model:      Metal compute buffer size =   592.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.00 MiB
llama_new_context_with_model: graph splits (measure): 2
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'general.quantization_version': '2', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.model': 'llama', 'llama.attention.head_count_kv': '8', 'llama.context_length': '32768', 'llama.attention.head_count': '32', 'llama.rope.freq_base': '10000.000000', 'llama.rope.dimension_count': '128', 'general.file_type': '17', 'llama.feed_forward_length': '14336', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'general.architecture': 'llama', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'general.name': 'huggingfaceh4_zephyr-7b-beta'}
Model {'base_model': 'llama', 'base_model0': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'zephyr', 'prompt_dict': {'promptA': '<|system|>\nYou are an AI that follows instructions extremely well and as helpful as possible.</s>\n', 'promptB': '<|system|>\nYou are an AI that follows instructions extremely well and as helpful as possible.</s>\n', 'PreInstruct': '<|user|>\n', 'PreInput': None, 'PreResponse': '</s>\n<|assistant|>\n', 'terminate_response': ['<|assistant|>', '</s>'], 'chat_sep': '', 'chat_turn_sep': '</s>\n', 'humanstr': '<|user|>', 'botstr': '<|assistant|>', 'generates_leading_space': False, 'system_prompt': 'You are an AI that follows instructions extremely well and as helpful as possible.', 'can_handle_system_prompt': True}, 'visible_models': None, 'h2ogpt_key': None, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': False, 'use_flash_attention_2': False, 'load_gptq': '', 'load_awq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': False, 'gpu_id': None, 'compile_model': None, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf?download=true', 'model_name_gptj': '', 'model_name_gpt4all_llama': '', 'model_name_exllama_if_no_config': ''}, 'rope_scaling': {}, 'max_seq_len': 4096, 'max_output_seq_len': None, 'exllama_dict': {}, 'gptq_dict': {}, 'attention_sinks': False, 'sink_dict': {}, 'truncation_generation': False, 'hf_model_dict': {}}
Begin auto-detect HF cache text generation models
/Users/mathanraj/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/1a1d410c70591fcc1a46486a254cd0e600e7b1b4/configuration_mpt.py:114: UserWarning: alibi or rope is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi or rope is turned on, setting `learned_pos_emb` to `False.`')
/Users/mathanraj/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/1a1d410c70591fcc1a46486a254cd0e600e7b1b4/configuration_mpt.py:141: UserWarning: If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".
  warnings.warn(UserWarning('If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".'))
End auto-detect HF cache text generation models
Begin auto-detect llama.cpp models
End auto-detect llama.cpp models
/Users/mathanraj/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/gradio/components/dropdown.py:173: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: None or set allow_custom_value=True.
  warnings.warn(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started Gradio Server and/or GUI: server_name: localhost port: None
Use local URL: http://localhost:7860/
/Users/mathanraj/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_name" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/Users/mathanraj/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_names" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
OpenAI API URL: http://0.0.0.0:5001
INFO:__name__:OpenAI API URL: http://0.0.0.0:5001
OpenAI API key: EMPTY
INFO:__name__:OpenAI API key: EMPTY

Screenshot 2024-02-29 at 3 59 53 PM