terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Provide alternative to shared file system cache from Terasilce asset loading - S3 #3561

Closed godber closed 3 months ago

godber commented 6 months ago

Right now, workers rely on the master to have unzipped an asset into a shared network drive. Without this shared disk cache, each worker must retrieve and unzip it's own asset to local disk. This causes excessive load on the ES state cluster used by the teraslice master.

There might be multiple options to achieving this. I will list options below.

Note: I use "workers" above to mean both execution controllers and workers.

@kstaken @jsnoble

godber commented 6 months ago

This is pretty open ended, and I think a review of the code and perhaps brief discussion with @jsnoble would help eliminate or consider the second option "In master memory asset cache". If that doesn't turn out to be an obvious fix, then the "S3 based asset storage model " option is the next choice. The extra details there are notional and all up for discussion.

busma13 commented 6 months ago

In terafoundation connector seems to refer to the service and connection refers to the name of an endpoint for a service. I think it will be clearer to use asset-storage-connection instead of asset-storage-connector.

busma13 commented 6 months ago

While setting up error handling for the new S3Store class I came across this possible issue: #3565

busma13 commented 6 months ago

@kstaken suggested that we make this feature more generic, allowing for assets to be stored in other locations in the future. With that in mind, it would make sense to specify the connector and the connection. We were assuming S3 as the connector if there is an asset-storage-connection field. asset-storage-connector will default to ES, will now work for S3, and could be expanded to use some other storage.

sotojn commented 6 months ago

When doing a get request to the txt/assets api endpoint, we are thinking about adding an extra column called external_storage that indicates if there is an associated s3 object with the es assets metadata.

Before:

name           version  id                                        _created                  description                     node_version  platform  arch
-------------  -------  ----------------------------------------  ------------------------  ------------------------------  ------------  --------  ----
standard       0.22.3   85e2f713c615c74b70f4ebe12a1735e619832e52  2024-03-11T16:11:29.153Z  Teraslice standard processor a  18                          
kafka          3.5.2    4fac0e2bfefd18cb1915fc3aae8cc729b16c4533  2024-03-11T16:11:29.003Z  Kafka reader and writer suppor  18                          
elasticsearch  3.5.4    bf127b4c744c9d34284b026962c2b7b81f5f8e9d  2024-03-11T16:11:28.841Z                                  18                          

After:

name           version  id                                        _created                  description                     node_version  external_storage  platform  arch
-------------  -------  ----------------------------------------  ------------------------  ------------------------------  ------------  ----------------  --------  ----
standard       0.22.3   85e2f713c615c74b70f4ebe12a1735e619832e52  2024-03-11T16:59:11.625Z  Teraslice standard processor a  18            available                       
kafka          3.5.2    4fac0e2bfefd18cb1915fc3aae8cc729b16c4533  2024-03-11T16:59:11.503Z  Kafka reader and writer suppor  18            available                       
elasticsearch  3.5.4    bf127b4c744c9d34284b026962c2b7b81f5f8e9d  2024-03-11T16:59:11.438Z                                  18            available                       
busma13 commented 6 months ago

ref: #3563

godber commented 6 months ago

After some discussion we have decided it is OK to proceed without dealing with asset components in S3 that are not present in ES. We will file an issue on this. This gets tricky when the number of assets get large, it gets messier reconciling the two lists.

godber commented 6 months ago

We also discussed the possibility of renaming the objects that get stored in S3 to include the source zipfile name as a convenience for the S3 operators. So the object s3://bucket/7c308569d43dd642ef41106c355d713136657534.zip would become s3://bucket/7c308569d43dd642ef41106c355d713136657534/standard-v0.22.3-node-16-bundle.zip. But ultimately decided that might complicate things further and actually be undesirable to "leak" information to the storage layer like that.

busma13 commented 5 months ago

If asset_storage_bucket is not specified in terafoundation it will default to tera-assets. Defaulting to ts-assets-<TERASLICE NAME> would throw an error if <TERASLICE NAME> had invalid bucket name characters like underscores. We could add a function to strip out invalid characters, but that could get complicated.

EDIT: We are using ts-assets-<TERASLICE NAME> by default and changing any underscores in the name to dashes.

godber commented 4 months ago

I'll close this once all of the pieces are in place and we've used it a bit more. So far it has looked good.

godber commented 3 months ago

This has been working well and v1.4.0 has been rolled out in many places with S3 assets enabled.