Closed godber closed 5 months ago
This is pretty open ended, and I think a review of the code and perhaps brief discussion with @jsnoble would help eliminate or consider the second option "In master memory asset cache". If that doesn't turn out to be an obvious fix, then the "S3 based asset storage model " option is the next choice. The extra details there are notional and all up for discussion.
In terafoundation connector
seems to refer to the service and connection
refers to the name of an endpoint for a service. I think it will be clearer to use asset-storage-connection
instead of asset-storage-connector
.
While setting up error handling for the new S3Store
class I came across this possible issue: #3565
@kstaken suggested that we make this feature more generic, allowing for assets to be stored in other locations in the future. With that in mind, it would make sense to specify the connector and the connection. We were assuming S3 as the connector if there is an asset-storage-connection
field. asset-storage-connector
will default to ES, will now work for S3, and could be expanded to use some other storage.
When doing a get request to the txt/assets
api endpoint, we are thinking about adding an extra column called external_storage
that indicates if there is an associated s3 object with the es assets metadata.
Before:
name version id _created description node_version platform arch
------------- ------- ---------------------------------------- ------------------------ ------------------------------ ------------ -------- ----
standard 0.22.3 85e2f713c615c74b70f4ebe12a1735e619832e52 2024-03-11T16:11:29.153Z Teraslice standard processor a 18
kafka 3.5.2 4fac0e2bfefd18cb1915fc3aae8cc729b16c4533 2024-03-11T16:11:29.003Z Kafka reader and writer suppor 18
elasticsearch 3.5.4 bf127b4c744c9d34284b026962c2b7b81f5f8e9d 2024-03-11T16:11:28.841Z 18
After:
name version id _created description node_version external_storage platform arch
------------- ------- ---------------------------------------- ------------------------ ------------------------------ ------------ ---------------- -------- ----
standard 0.22.3 85e2f713c615c74b70f4ebe12a1735e619832e52 2024-03-11T16:59:11.625Z Teraslice standard processor a 18 available
kafka 3.5.2 4fac0e2bfefd18cb1915fc3aae8cc729b16c4533 2024-03-11T16:59:11.503Z Kafka reader and writer suppor 18 available
elasticsearch 3.5.4 bf127b4c744c9d34284b026962c2b7b81f5f8e9d 2024-03-11T16:59:11.438Z 18 available
ref: #3563
After some discussion we have decided it is OK to proceed without dealing with asset components in S3 that are not present in ES. We will file an issue on this. This gets tricky when the number of assets get large, it gets messier reconciling the two lists.
We also discussed the possibility of renaming the objects that get stored in S3 to include the source zipfile name as a convenience for the S3 operators. So the object s3://bucket/7c308569d43dd642ef41106c355d713136657534.zip
would become s3://bucket/7c308569d43dd642ef41106c355d713136657534/standard-v0.22.3-node-16-bundle.zip
. But ultimately decided that might complicate things further and actually be undesirable to "leak" information to the storage layer like that.
If asset_storage_bucket
is not specified in terafoundation it will default to tera-assets
. Defaulting to ts-assets-<TERASLICE NAME>
would throw an error if <TERASLICE NAME>
had invalid bucket name characters like underscores. We could add a function to strip out invalid characters, but that could get complicated.
EDIT: We are using ts-assets-<TERASLICE NAME>
by default and changing any underscores in the name to dashes.
I'll close this once all of the pieces are in place and we've used it a bit more. So far it has looked good.
This has been working well and v1.4.0 has been rolled out in many places with S3 assets enabled.
Right now, workers rely on the master to have unzipped an asset into a shared network drive. Without this shared disk cache, each worker must retrieve and unzip it's own asset to local disk. This causes excessive load on the ES state cluster used by the teraslice master.
There might be multiple options to achieving this. I will list options below.
asset_storage_connector
- optional, but required if using S3 asset storage, where the operators specify which S3 connector stores the assetsasset_storage_bucket
- optional, defaultts-assets-<TERASLICE NAME>/
Note: I use "workers" above to mean both execution controllers and workers.
@kstaken @jsnoble