Closed cgjosephlee closed 5 years ago
Hi @cgjosephlee -- good catch; we'll get in symbolic link support.
Regarding os.walk
, what version of Python are you using? I see some effort was put into improving this in 3.5, though I don't know how it compares to find
:
Fixed symbolic links in 1.1.1.
Hi @fbrennen,
Thank you for clarifying the improvement in os.walk()
. I'm using 3.6.5 and I notice that the bottleneck was IO of our filesystem.
I have another coding style which is about 40% faster with 4.5 millions files, since os.walk()
already returns filenames.
import sys, os
from glob import glob
import cProfile
def func_walk(path):
file_list = [y for x in os.walk(path, followlinks=True) for y in glob(os.path.join(x[0], '*.fast5'))]
print(len(file_list))
def func_walk2(path):
file_list = [os.path.join(x[0], y) for x in os.walk(path, followlinks=True) for y in x[2] if y.endswith('.fast5')]
print(len(file_list))
d = sys.argv[1]
# func_walk(d)
# func_walk2(d)
cProfile.run('func_walk("{}")'.format(d))
cProfile.run('func_walk2("{}")'.format(d))
and results
68094986 function calls (68093840 primitive calls) in 27.707 seconds
vs.
45384922 function calls (45383786 primitive calls) in 15.445 seconds
Hope this helps :D
Ah, ok -- I can see why that would be faster. Thanks for the tip. =)
Symbolic links are commonly used for managing directory structure. Also
guppy
does search into symbolic links, I think this should be consistent with other tools.https://github.com/nanoporetech/ont_fast5_api/blob/fab28c4c04c9d0db20c0e45f6dfc6c19b4e49ee6/ont_fast5_api/conversion_tools/conversion_utils.py#L16 simply a
os.walk(..., followlinks=True)
will do well.And I would suggest using
find
on GNU system since it is >10x faster thanos.walk()
, especially when there are millions of fast5 files.