nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
149 stars 29 forks source link

`--recursive` option do not search into symbolic links #5

Closed cgjosephlee closed 5 years ago

cgjosephlee commented 5 years ago

Symbolic links are commonly used for managing directory structure. Also guppy does search into symbolic links, I think this should be consistent with other tools.

https://github.com/nanoporetech/ont_fast5_api/blob/fab28c4c04c9d0db20c0e45f6dfc6c19b4e49ee6/ont_fast5_api/conversion_tools/conversion_utils.py#L16 simply a os.walk(..., followlinks=True) will do well.

And I would suggest using find on GNU system since it is >10x faster than os.walk(), especially when there are millions of fast5 files.

fbrennen commented 5 years ago

Hi @cgjosephlee -- good catch; we'll get in symbolic link support.

Regarding os.walk, what version of Python are you using? I see some effort was put into improving this in 3.5, though I don't know how it compares to find:

https://www.python.org/dev/peps/pep-0471/

fbrennen commented 5 years ago

Fixed symbolic links in 1.1.1.

cgjosephlee commented 5 years ago

Hi @fbrennen, Thank you for clarifying the improvement in os.walk(). I'm using 3.6.5 and I notice that the bottleneck was IO of our filesystem.

I have another coding style which is about 40% faster with 4.5 millions files, since os.walk() already returns filenames.

import sys, os
from glob import glob
import cProfile

def func_walk(path):
    file_list = [y for x in os.walk(path, followlinks=True) for y in glob(os.path.join(x[0], '*.fast5'))]
    print(len(file_list))

def func_walk2(path):
    file_list = [os.path.join(x[0], y) for x in os.walk(path, followlinks=True) for y in x[2] if y.endswith('.fast5')]
    print(len(file_list))

d = sys.argv[1]

# func_walk(d)
# func_walk2(d)
cProfile.run('func_walk("{}")'.format(d))
cProfile.run('func_walk2("{}")'.format(d))

and results

68094986 function calls (68093840 primitive calls) in 27.707 seconds
vs.
45384922 function calls (45383786 primitive calls) in 15.445 seconds

Hope this helps :D

fbrennen commented 5 years ago

Ah, ok -- I can see why that would be faster. Thanks for the tip. =)