rom1504 / embedding-reader

Efficiently read embedding in streaming from any filesystem
MIT License
94 stars 19 forks source link

Slow and incorrect exploration of embedding files with fs.glob() #11

Closed victor-paltz closed 2 years ago

victor-paltz commented 2 years ago

When looking for the list of files having the requested file_format, this code is not optimal because fsspec will explore all the files in the parent folder. It can even explore other unwanted files.

glob_pattern = path.rstrip("/") + f"**/*.{file_format}"

ex: if we want to find all the files ending with .npy in /tmp/tmpeejv3hoh fs.glob("/tmp/tmpeejv3hoh*/.npy") will explore all the files in /tmp and could even match wrong files like /tmp/tmpeejv3hoh_2/toto.npy

https://github.com/rom1504/embedding-reader/blob/11d237d2b0ac95423b0477dac438e17d3e05b689/embedding_reader/get_file_list.py#L45

victor-paltz commented 2 years ago

replacing **/* with /** is fixing the issue, PR is coming

rom1504 commented 2 years ago

thanks