guard downloads of the HIGGS.csv file with if statements so you don't have to wait to re-download the data if it already exists locally
helpful, because downloading the data can take 10+ minutes
~put a floor of sagemaker>=2.198 in the lifecycle configuration script used to create the rapids JupyterLab kernel for the SageMaker notebook instance~
details about why I'd added that sagemaker floor at first (click me)
Without that floor on `sagemaker`, even just running `import sagemaker` at the top of a notebook fails like this:
```text
TypeError Traceback (most recent call last)
Cell In[1], line 4
1 import time
3 import boto3
----> 4 import sagemaker
File ~/anaconda3/envs/rapids/lib/python3.11/site-packages/sagemaker/__init__.py:18
14 from __future__ import absolute_import
16 import importlib_metadata
---> 18 from sagemaker import estimator, parameter, tuner # noqa: F401
19 from sagemaker.amazon.kmeans import KMeans, KMeansModel, KMeansPredictor # noqa: F401
20 from sagemaker.amazon.pca import PCA, PCAModel, PCAPredictor # noqa: F401
...
File ~/anaconda3/envs/rapids/lib/python3.11/site-packages/google/protobuf/descriptor.py:621, in FieldDescriptor.__new__(cls, name, full_name, index, number, type, cpp_type, label, default_value, message_type, enum_type, containing_type, is_extension, extension_scope, options, serialized_options, has_default_value, containing_oneof, json_name, file, create_key)
615 def __new__(cls, name, full_name, index, number, type, cpp_type, label,
616 default_value, message_type, enum_type, containing_type,
617 is_extension, extension_scope, options=None,
618 serialized_options=None,
619 has_default_value=True, containing_oneof=None, json_name=None,
620 file=None, create_key=None): # pylint: disable=redefined-builtin
--> 621 _message.Message._CheckCalledFromGeneratedFile()
622 if is_extension:
623 return _message.default_pool.FindExtensionByName(full_name)
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
```
Conda was choosing a mix of very-old sagemaker and very-new `protobuf`:
```text
- protobuf=5.27.5=py311hfdbb021_0
- protobuf3-to-dict=0.1.5=py311h38be061_9
...
- sagemaker=2.75.1=pyhd8ed1ab_0
```
Forcing installation of a newer `sagemaker` (that controls its `protobuf` dependency better) seems to help.
Some related discussion of this issue:
* https://github.com/aws/amazon-sagemaker-examples/issues/4387
* https://stackoverflow.com/questions/72441758/typeerror-descriptors-cannot-not-be-created-directly
Proposes the following updates to https://docs.rapids.ai/deployment/nightly/cloud/aws/sagemaker/
guard downloads of the
HIGGS.csv
file withif
statements so you don't have to wait to re-download the data if it already exists locally~put a floor of
sagemaker>=2.198
in the lifecycle configuration script used to create therapids
JupyterLab kernel for the SageMaker notebook instance~Notes for Reviewers
I'd originally also added a floor on the
sagemaker
Python here, but reverted that per https://github.com/rapidsai/deployment/pull/446#discussion_r1790589446.details about why I'd added that sagemaker floor at first (click me)
Without that floor on `sagemaker`, even just running `import sagemaker` at the top of a notebook fails like this: ```text TypeError Traceback (most recent call last) Cell In[1], line 4 1 import time 3 import boto3 ----> 4 import sagemaker File ~/anaconda3/envs/rapids/lib/python3.11/site-packages/sagemaker/__init__.py:18 14 from __future__ import absolute_import 16 import importlib_metadata ---> 18 from sagemaker import estimator, parameter, tuner # noqa: F401 19 from sagemaker.amazon.kmeans import KMeans, KMeansModel, KMeansPredictor # noqa: F401 20 from sagemaker.amazon.pca import PCA, PCAModel, PCAPredictor # noqa: F401 ... File ~/anaconda3/envs/rapids/lib/python3.11/site-packages/google/protobuf/descriptor.py:621, in FieldDescriptor.__new__(cls, name, full_name, index, number, type, cpp_type, label, default_value, message_type, enum_type, containing_type, is_extension, extension_scope, options, serialized_options, has_default_value, containing_oneof, json_name, file, create_key) 615 def __new__(cls, name, full_name, index, number, type, cpp_type, label, 616 default_value, message_type, enum_type, containing_type, 617 is_extension, extension_scope, options=None, 618 serialized_options=None, 619 has_default_value=True, containing_oneof=None, json_name=None, 620 file=None, create_key=None): # pylint: disable=redefined-builtin --> 621 _message.Message._CheckCalledFromGeneratedFile() 622 if is_extension: 623 return _message.default_pool.FindExtensionByName(full_name) TypeError: Descriptors cannot be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are: 1. Downgrade the protobuf package to 3.20.x or lower. 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower). More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates ``` Conda was choosing a mix of very-old sagemaker and very-new `protobuf`: ```text - protobuf=5.27.5=py311hfdbb021_0 - protobuf3-to-dict=0.1.5=py311h38be061_9 ... - sagemaker=2.75.1=pyhd8ed1ab_0 ``` Forcing installation of a newer `sagemaker` (that controls its `protobuf` dependency better) seems to help. Some related discussion of this issue: * https://github.com/aws/amazon-sagemaker-examples/issues/4387 * https://stackoverflow.com/questions/72441758/typeerror-descriptors-cannot-not-be-created-directly