treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.45k stars 354 forks source link

Batch DBIO markers in listObjects #7863

Open arielshaqed opened 5 months ago

arielshaqed commented 5 months ago

DBIO (DataBricks) performs many getFileStatus calls over lakeFSFS. Each of these calls looks for an object or directory marker named _started_* or _committed_*. Looking for a marker involves a getObject, but if the object is not found we listObject(..., 1) at it to discover whether it's a directory marker.[^1]

We already optionally batch getObject for these. Optionally (same config parameters) also batch listObjects calls for these with amount==1.

[^1]: Short story: Hadoop FileSystems need getFileStatus to return a lot of information, including whether that file is a directory. That's actually hard to do correctly, and lakeFSFS performs multiple operations for each getFileStatus.

offirc2 commented 4 months ago

@itaiad200 @arielshaqed Any update on this?