Unable to run this morning, yesterday it ran fine

mmr-crexi commented 1 year ago

Morning! I need help getting the models to run a second time, on a new instance.

Yesterday, I registered for and downloaded the models onto an AWS sagemaker instance. Everything worked fine and I was able to run

pip install -e .

And from there experiment with the models. I shut down the instance and this morning started it again. I reran the pip installation, but now, everything hangs at this step:

sh-4.2$ torchrun --nproc_per_node 4 example_instructions.py     --ckpt_dir CodeLlama-34b-Instruct/     --tokenizer_path CodeLlama-34b-Instruct/tokenizer.model     --max_seq_len 512 --max_batch_size 4
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1

This same code would finish loading the model after 8 seconds or so and be good to go. I've tried this with the 7b instruct model, the 13b instruct, and the 34b instruct; all worked fine yesterday, none work today.

How can I make this work? Did I forget some crucial step?

For the rest of this bug report, it's basically how I arrived at the conclusion that

checkpoint = torch.load(ckpt_path, map_location="cpu")

Is not working, and I'm not sure why. Once I get to that point, the RAM usage rises from 1.8GB to 28.9GB, so it looks like it's at least found the first file in the checkpoint. This instance g5.12xlarge has 196GB and 4 24GB GPUs (and everything worked yesterday).

To figure this all out, I went into generation.py in the llama directory, and I added in some line number inspections. I added in lines like:

from inspect import getframeinfo, currentframe

print(f"Got to {getframeinfo(currentframe()).lineno}")

The code in generation now looks like:

        print(f"Got to {getframeinfo(currentframe()).lineno}")
        assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
        assert model_parallel_size == len(
            checkpoints
        ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"

        print(f"Got to {getframeinfo(currentframe()).lineno}")
        ckpt_path = checkpoints[get_model_parallel_rank()]
        print(f"Got to {getframeinfo(currentframe()).lineno}")
        checkpoint = torch.load(ckpt_path, map_location="cpu")
        print(f"Got to {getframeinfo(currentframe()).lineno}")
        with open(Path(ckpt_dir) / "params.json", "r") as f:
            params = json.loads(f.read())
        print(f"Got to {getframeinfo(currentframe()).lineno}")
        model_args: ModelArgs = ModelArgs(
            max_seq_len=max_seq_len,
            max_batch_size=max_batch_size,
            **params,
        )
        tokenizer = Tokenizer(model_path=tokenizer_path)
        model_args.vocab_size = tokenizer.n_words
        print(f"Got to {getframeinfo(currentframe()).lineno}")

and the run output looks like:

sh-4.2$ torchrun --nproc_per_node 4 example_instructions.py     --ckpt_dir CodeLlama-34b-Instruct/     --tokenizer_path CodeLlama-34b-Instruct/tokenizer.model     --max_seq_len 512 --max_batch_size 4
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
Got to 86
Got to 92
Got to 94

which is checkpoint = torch.load(ckpt_path, map_location="cpu")

My pip freeze:

sh-4.2$ pip freeze
aiobotocore @ file:///home/conda/feedstock_root/build_artifacts/aiobotocore_1691451276487/work
aiofiles==22.1.0
aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1689804989077/work
aioitertools @ file:///home/conda/feedstock_root/build_artifacts/aioitertools_1663521246073/work
aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work
aiosqlite==0.19.0
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1688651106312/work/dist
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1640817743617/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1666850768662/work
arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1662382474514/work
astroid==2.15.6
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1670263926556/work
async-timeout @ file:///home/conda/feedstock_root/build_artifacts/async-timeout_1691763562544/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1683424013410/work
autopep8==2.0.2
autovizwidget @ file:///home/conda/feedstock_root/build_artifacts/autovizwidget_1680800327357/work
awscli==1.29.28
Babel==2.12.1
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1687772187254/work
beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1680888073205/work
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1674535352125/work
boto3==1.28.28
botocore==1.31.28
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1666764671472/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
certifi==2023.7.22
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179353105/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1688813409104/work
cloudpickle==2.2.1
cmake==3.27.2
-e git+ssh://git@github.com/facebookresearch/codellama.git@cb51c14ec761370ba2e2bc351374a79265d0465e#egg=codellama
colorama==0.4.4
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1691044910542/work
contextlib2==21.6.0
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1672672382195/work
debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1691021228385/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work
dill==0.3.7
docker==6.1.3
docstring-to-markdown==0.12
docutils==0.16
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
environment-kernels==1.2.0
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1692026125334/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1667317341051/work
fairscale==0.4.13
fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1690055433477/work/dist
filelock==3.12.3
fire==0.5.0
flit_core @ file:///home/conda/feedstock_root/build_artifacts/flit-core_1684084314667/work/source/flit_core
fqdn @ file:///home/conda/feedstock_root/build_artifacts/fqdn_1638810296540/work/dist
frozenlist @ file:///home/conda/feedstock_root/build_artifacts/frozenlist_1689244399117/work
fsspec @ file:///home/conda/feedstock_root/build_artifacts/fsspec_1626188337504/work
gitdb==4.0.10
GitPython==3.1.32
google-pasta==0.2.0
hdijupyterutils @ file:///home/conda/feedstock_root/build_artifacts/hdijupyterutils_1680800332182/work
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
importlib-metadata==6.8.0
importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1691408075105/work
ipykernel==5.5.6
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1685727741709/work
ipython-genutils==0.2.0
ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1690877070294/work
isoduration @ file:///home/conda/feedstock_root/build_artifacts/isoduration_1638811571363/work/dist
isort==5.12.0
jedi==0.18.2
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work
jmespath @ file:///home/conda/feedstock_root/build_artifacts/jmespath_1655568249366/work
json5==0.9.14
jsonpointer==2.0
jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1691761378595/work
jsonschema-specifications @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-specifications_1689701150890/work
jupyter @ file:///home/conda/feedstock_root/build_artifacts/jupyter_1670249595582/work
jupyter-console @ file:///home/conda/feedstock_root/build_artifacts/jupyter_console_1678118109161/work
jupyter-events @ file:///home/conda/feedstock_root/build_artifacts/jupyter_events_1691505939576/work
jupyter-lsp==2.2.0
jupyter-server-mathjax==0.2.6
jupyter-server-proxy @ git+https://github.com/jupyterhub/jupyter-server-proxy@2d7dd346bb595106b417476de870a348943f3c70
jupyter-ydoc==0.2.5
jupyter_client==7.4.9
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1686775611663/work
jupyter_server @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_1692108700252/work
jupyter_server_fileid==0.9.0
jupyter_server_terminals @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_terminals_1673491454549/work
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.5
jupyterlab-git==0.41.0
jupyterlab-lsp==4.2.0
jupyterlab-pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1649936611996/work
jupyterlab-widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1688489450369/work
jupyterlab_server==2.24.0
lazy-object-proxy==1.9.0
lit==16.0.6
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1685769049201/work
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
mccabe==0.7.0
mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1675771498296/work
mock @ file:///home/conda/feedstock_root/build_artifacts/mock_1689092066756/work
mpmath==1.3.0
multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1672339403932/work
multiprocess==0.70.15
nb-conda @ file:///home/conda/feedstock_root/build_artifacts/nb_conda_1654442778977/work
nb-conda-kernels @ file:///home/conda/feedstock_root/build_artifacts/nb_conda_kernels_1667060632461/work
nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1675369808718/work
nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1684790896106/work
nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert-meta_1674590374792/work
nbdime==3.2.1
nbexamples @ file:///opt/nbexamples
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1690814868471/work
nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1664684991461/work
networkx==3.1
nose @ file:///home/conda/feedstock_root/build_artifacts/nose_1602434998960/work
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1691436218243/work
notebook_shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1682360583588/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1691056231492/work
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
overrides @ file:///home/conda/feedstock_root/build_artifacts/overrides_1691338815398/work
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1681337016113/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1688740542634/work
pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pathos==0.3.1
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pid==3.0.4
pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1633981968097/work
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1690813113769/work
plotly @ file:///home/conda/feedstock_root/build_artifacts/plotly_1692220561510/work
pluggy==1.2.0
pox==0.3.3
ppft==1.7.6.7
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1689032443210/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1688565951714/work
protobuf==4.23.4
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1681775027942/work
psycopg2 @ file:///home/conda/feedstock_root/build_artifacts/psycopg2-split_1667025517155/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
py4j==0.10.9.5
pyasn1==0.5.0
pycodestyle==2.10.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pydocstyle==6.3.0
pyflakes==3.0.1
pygal==3.0.0
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1691408637400/work
pykerberos @ file:///home/conda/feedstock_root/build_artifacts/pykerberos_1671204518513/work
pylint==2.17.5
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1685514481738/work
PyQt5==5.12.3
PyQt5_sip==4.19.18
PyQtChart==5.12
PyQtWebEngine==5.12.1
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
pyspark==3.3.0
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
python-json-logger @ file:///home/conda/feedstock_root/build_artifacts/python-json-logger_1677079630776/work
python-lsp-jsonrpc==1.0.0
python-lsp-server==1.7.4
pytoolconfig==1.2.5
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1680088766131/work
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1666772395347/work
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1666828497229/work
qtconsole @ file:///home/conda/feedstock_root/build_artifacts/qtconsole-base_1683329453903/work
QtPy @ file:///home/conda/feedstock_root/build_artifacts/qtpy_1680148448366/work
referencing @ file:///home/conda/feedstock_root/build_artifacts/referencing_1691337268233/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1684774241324/work
requests-kerberos @ file:///home/conda/feedstock_root/build_artifacts/requests-kerberos_1667464887610/work
rfc3339-validator @ file:///home/conda/feedstock_root/build_artifacts/rfc3339-validator_1638811747357/work
rfc3986-validator @ file:///home/conda/feedstock_root/build_artifacts/rfc3986-validator_1598024191506/work
rope==1.9.0
rpds-py @ file:///home/conda/feedstock_root/build_artifacts/rpds-py_1689705060450/work
rsa==4.7.2
s3fs @ file:///home/conda/feedstock_root/build_artifacts/s3fs_1626193591467/work
s3transfer @ file:///home/conda/feedstock_root/build_artifacts/s3transfer_1692149178344/work
sagemaker==2.177.1
sagemaker-experiments==0.1.45
sagemaker-nbi-agent @ file:///opt/sagemaker_nbi_agent
sagemaker-pyspark==1.4.5
schema==0.7.5
Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1682601222253/work
sentencepiece==0.1.99
simpervisor==1.0.0
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
smdebug-rulesconfig==1.0.1
smmap==5.0.0
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1662051266223/work
snowballstemmer==2.2.0
soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1658207591808/work
sparkmagic @ file:///home/conda/feedstock_root/build_artifacts/sparkmagic_1680849855330/work/sparkmagic
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
sympy==1.12
tblib==1.7.0
tenacity @ file:///home/conda/feedstock_root/build_artifacts/tenacity_1692026804430/work
termcolor==2.3.0
terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1670253674810/work
tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1666100256010/work
tomli==2.0.1
tomlkit==0.12.1
torch==2.0.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1684150054582/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1675110562325/work
triton==2.0.0
typing-utils @ file:///home/conda/feedstock_root/build_artifacts/typing_utils_1622899189314/work
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1688315532570/work
tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1680081134351/work
ujson==5.8.0
uri-template @ file:///home/conda/feedstock_root/build_artifacts/uri-template_1688655812972/work/dist
urllib3==1.26.14
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1673864653149/work
webcolors @ file:///home/conda/feedstock_root/build_artifacts/webcolors_1679900785843/work
webencodings==0.5.1
websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1687789148259/work
widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1688504439014/work
wrapt @ file:///home/conda/feedstock_root/build_artifacts/wrapt_1677485519705/work
y-py==0.6.0
yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1685191749966/work
ypy-websocket==0.8.4
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1689374466814/work

GaganHonor commented 1 year ago

╰(°▽°)╯ohhhhhhhhhhhhh The error message you are getting is:

Got to 86
Got to 92
Got to 94

This means that the code is reaching line 86, 92, and 94 of the example_instructions.py file. These lines are responsible for loading the model checkpoint, initializing the model parallel, and initializing the pipeline.

The reason why the code is hanging at this step is because the model checkpoint is too large to fit on the CPU. The CodeLlama-34b-Instruct model has a size of 1.6GB, which is larger than the 1.8GB of RAM that is available on the CPU.

To fix this error, you need to move the model checkpoint to the GPU. You can do this by running the following command:

cp CodeLlama-34b-Instruct.ckpt /tmp/CodeLlama-34b-Instruct.ckpt

Once you have moved the model checkpoint to the GPU, you need to update the example_instructions.py file to load the model checkpoint from the GPU. You can do this by changing the map_location argument to "cuda".

The updated code should look like this:

checkpoint = torch.load("/tmp/CodeLlama-34b-Instruct.ckpt", map_location="cuda")

Once you have made these changes, you should be able to run the example_instructions.py file without any errors.

Here are some additional things you can try:

Make sure that you are using the correct version of PyTorch. The torchrun command requires PyTorch version 1.8 or higher.
Make sure that the nccl library is installed in the same location as your PyTorch distribution.
Try running the torchrun command with the --use_gloo flag. This will use the Gloo backend instead of NCCL.

If you are still having trouble, you can ask for help here ❇️❇️ I hope this helps!

mmr-crexi commented 1 year ago

Admit it @GaganHonor, you just took my bug report and plugged it into a generative AI for an answer.

What makes me think that?

You got the name of the file wrong where I inserted line numbers, and reused the explanation of where the line numbers were but for that wrong file. The line numbers were in generation.py, not example_instructions.py
The RAM amounts you listed are ludicrously low. g5.12xlarges have more than enough RAM to load the 34b parameter model into RAM, and in fact did so on the preceding day, not "1.8GB of RAM". Looking up the instance type would tell you that, if the amount that I put in the report seemed ludicrous to you.
The very first thing I did when resetting up the environment is to run pip install -e ., which you would know if you had read the report. That would make sure that all dependencies are installed properly, and indeed, those dependencies were installed the first time. Perhaps they do not install a second time?

It very well may be that the torch load should not be to CPU first, but that's what the generation.py script provided in this repo does, and it's worked at least once. I strongly suspect that there's either some kind of unnamed dependency not being installed, or some implicit assumption that the machine running the models will be persisted from run to run, rather than shut down and restarted with a blank slate.

Were you able to reproduce the issue? It may also just be that, for whatever reason, the Sagemaker instance I was using changed in some fundamental way, or that issues like pytorch #99625 were somehow manifesting one day and not the other. If so, I would really like to know what happened and how you were able to actually solve things.

GaganHonor commented 1 year ago

why will I not admit ? I used codellama 34B model , Along with some HF Plugins 💀 My intention was to help you @mmr-crexi

still sorry CallOfDutyGhostGIF

mmr-crexi commented 1 year ago

Well, you've done a great job demonstrating its limitations :)

In all seriousness, I may be dealing with just some heisenbug in my setup, and that would be unfortunate, but I would not be devastated if no one had an answer for my corner case.

GaganHonor commented 1 year ago

well one thing i found is my model is answering far better since first build , It's far better thn Claude or GPT 3.5 turbo currently , I am fixing

mmr-crexi commented 1 year ago

Maybe? But it's still not adding much to the conversation.

These models work well when you can understand what they're saying and change their output into whatever's appropriate for the situation. The response you gave to this bug, for instance, was so off the mark that it seemed like you just cut and pasted the response without thinking about what was actually being said, whether or not it was actually a help. That's just noise, that's not insight.

GaganHonor commented 1 year ago

I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

meta-llama / codellama

Unable to run this morning, yesterday it ran fine #63