justindujardin / pathy

simple, flexible, offline capable, cloud storage with a Python path-like interface
Apache License 2.0
170 stars 23 forks source link

Feature/ls #48

Closed justindujardin closed 3 years ago

justindujardin commented 3 years ago

Problem

When listing a large number of blobs and their changed time / size attributes, performance is not great. This is because when dealing with remote systems like GCS, the stat operation is slow when executed a bunch of times. Consider the following code that is fast at enumerating large numbers of files/stats on local systems but slow for remote GCS buckets.

from pathy import Pathy

files = Pathy("gs://my_bucket/images")
for file in files.iterdir():
    stat = file.stat()
    print(f"{file.name}, {stat.size}, {stat.last_modified}")

The trouble is that we have to make requests to get the file listings, and then extra requests for each blob to get the stat information. If you have hundreds or thousands of blobs this gets quite slow.

Solution

Add a helper method ls that does not exist in the standard pathlib.Path interface. This yields blobs and their size/time stats with a single-pass, and ends up being much quicker than the above example when dealing with remote storage. The example above would now be more appropriately implemented as:

files = Pathy("gs://my_bucket/images")
for file in files.ls():
    print(f"{file.name}, {file.size}, {file.last_modified}")

Changes

Add an ls method to Pathy objects. To provide a consistent API, Pathy.fluid now returns a pathy.BasePath object when dealing with local files. This class is a light wrapper on top of pathlib.Path that adds the pathy specific methods like ls.

Add a long-form -l flag to the cli's ls command that prints blob size and updated time stats next to their names.

codecov[bot] commented 3 years ago

Codecov Report

Merging #48 (7dd0be4) into develop (d6ad724) will increase coverage by 0.27%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #48      +/-   ##
===========================================
+ Coverage    93.52%   93.80%   +0.27%     
===========================================
  Files           12       12              
  Lines         1777     1856      +79     
===========================================
+ Hits          1662     1741      +79     
  Misses         115      115              
Impacted Files Coverage Δ
pathy/__init__.py 100.00% <ø> (ø)
pathy/base.py 91.61% <100.00%> (+0.36%) :arrow_up:
pathy/cli.py 91.30% <100.00%> (+1.43%) :arrow_up:
pathy/file.py 88.38% <100.00%> (ø)
pathy/tests/test_base.py 98.39% <100.00%> (+0.03%) :arrow_up:
pathy/tests/test_cli.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d6ad724...7dd0be4. Read the comment docs.

justindujardin commented 3 years ago

:tada: This PR is included in version 0.3.6 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: