iterative / scmrepo

SCM wrapper and fsspec filesystem for Git for use in DVC.
https://dvc.org
Apache License 2.0
21 stars 14 forks source link

lfs: optimize path filtering #355

Closed sisp closed 6 months ago

sisp commented 6 months ago

I've optimized LFS path filtering as we discussed in #338. The first optimization implements the suggestion in https://github.com/iterative/scmrepo/issues/338#issuecomment-2081256097 to short-cut a single-path include filter. Here are the tests for the regex that extracts the path prefix: https://regex101.com/r/wBjHf0/1 Note that the extra \n on regex101.com is only necessary to allow one test case per line, it isn't needed in the actual regex. The second optimization unionizes the filename regex patterns derived from Unix filename patterns and matches each path against the pre-compiled single regex, which is faster than matching against the Unix filename patterns individually. Also, it avoids intermediate list materialization but instead implements a streaming filter.

As the two commits implement independent optimizations, I intend this PR to be rebase-merged without commit squashing.

Partially fixes #338.