skshetry / webdav4

WebDAV client library with a fsspec based filesystem and a CLI.
https://skshetry.github.io/webdav4
MIT License
61 stars 17 forks source link

when file is large, seek is very slow #155

Open observerss opened 1 year ago

observerss commented 1 year ago

In stream.py, seek function is

    def seek(self, offset: int, whence: int = 0) -> int:  # noqa: C901
        """Seek the file object."""
        if whence == 0:
            loc = offset
        elif whence == 1:
            if offset >= 0:
                self.read(offset)
                return self.loc
            loc = self.loc + offset
        elif whence == 2:
            if not self.size:
                raise ValueError("cannot seek to the end of file")
            loc = self.size + offset
        else:
            raise ValueError(f"invalid whence ({whence}, should be 0, 1 or 2)")
        if loc < 0:
            raise ValueError("Seek before start of file")
        if loc and not self.supports_ranges:
            raise ValueError("server does not support ranges")

        self.close()
        self._cm = iter_url(self.client, self.url, pos=loc, chunk_size=self.chunk_size)
        #  pylint: disable=no-member
        _, self._iterator = self._cm.__enter__()
        self.loc = loc
        return loc

when whence == 1 and offset > 0, the seek will read to the offset

            if offset >= 0:
                self.read(offset)
                return self.loc
            loc = self.loc + offset

to seek 1G later will read 1G content first, which is very inefficient If I comment out the if statement, the seek operation works too, it will create a new iterator, use Range header to fast locate the position

skshetry commented 1 year ago

I think it was added assuming that on SEEK_CUR, the offsets are small, and might be already cached in our buffer and that I wanted to reset the iterator as much as possible (not all webdav servers support ranges).

Feel free to propose a PR. 🙂