The has_objects implementation uses list_objects to get the list of all existing objects to compare them against the list of keys whose existence to check. The problem is that listing objects typically is a very expensive operation for object stores, never mind listing all keys present in the storage.
The has_objects method is called by the AbstractRepositoryBackend.delete_objects, and indirectly the AbstractRepositoryBackend.delete_object, method. We should investigate if we can avoid using list_objects in has_objects. One approach would be to call HEAD for each object which allows to get the metadata of an object, without retrieving the object itself. This would probably be more efficient if there are few keys to check. But there should be a cross-over point where if the keys passed to has_objects is large enough, the sheer amount of requests that have to be made (one per key) would exceed the cost of the list_objects.
The trouble is that the best solution probably therefore does not just depend on the number of objects in the repository, but also on the number of keys whose existence needs to be checked.
Alternatively, since the method is now only used directly by the delete object methods, maybe these can change their implementation to not explicitly check before deleting, but simply delete and catch errors for non-existing keys. The boto3.delete_objects method supports this and will delete existing objects and return an error message for non-existing ones. The only problem is that currently the AbstractRepositoryBackend.delete_objects is implemented such that no files are deleted as long as one of the provided keys does not exist. It is not clear if this behavior can be changed to simply delete those exist and log a message or raise for keys that did not exist.
The
has_objects
implementation useslist_objects
to get the list of all existing objects to compare them against the list of keys whose existence to check. The problem is that listing objects typically is a very expensive operation for object stores, never mind listing all keys present in the storage.The
has_objects
method is called by theAbstractRepositoryBackend.delete_objects
, and indirectly theAbstractRepositoryBackend.delete_object
, method. We should investigate if we can avoid usinglist_objects
inhas_objects
. One approach would be to callHEAD
for each object which allows to get the metadata of an object, without retrieving the object itself. This would probably be more efficient if there are few keys to check. But there should be a cross-over point where if thekeys
passed tohas_objects
is large enough, the sheer amount of requests that have to be made (one per key) would exceed the cost of thelist_objects
.The trouble is that the best solution probably therefore does not just depend on the number of objects in the repository, but also on the number of keys whose existence needs to be checked.
Alternatively, since the method is now only used directly by the delete object methods, maybe these can change their implementation to not explicitly check before deleting, but simply delete and catch errors for non-existing keys. The
boto3.delete_objects
method supports this and will delete existing objects and return an error message for non-existing ones. The only problem is that currently theAbstractRepositoryBackend.delete_objects
is implemented such that no files are deleted as long as one of the provided keys does not exist. It is not clear if this behavior can be changed to simply delete those exist and log a message or raise for keys that did not exist.