ggonnella / gfapy

Gfapy: a flexible and extensible software library for handling sequence graphs in Python
Other
64 stars 6 forks source link

Taking very long to load GFA1 from file #23

Closed fawaz-dabbaghieh closed 2 years ago

fawaz-dabbaghieh commented 2 years ago

I was trying to load a GFA1 from a file with gfapy but I had to kill the process because it is taking over 15 minutes and not finishing. I am not sure what could be wrong.

The GFA is a de Bruijn graph and is the output of convertToGFA.py , where this script converts the contigs from bcalm2 output to a valid GFA1 file. This graph has 944785 nodes and 2419232 edges.

Minimal example here:

import gfapy
import time

input_file = "sk1_y12_yeast_k43.gfa"

start = time.perf_counter()
graph = gfapy.Gfa.from_file(input_file)
print(f"it took {time.perf_counter() - start} seconds to load the file")

Is it supposed to take this long?

ggonnella commented 2 years ago

Difficult to tell, it depend on the graph and on the system, but indeed the graph is relatively long and currently Gfapy is entirely written in Python, so it has its limits...

Maybe you could try to set vlevel=0 in the from_file call? This disables validations, but should then be faster.

fawaz-dabbaghieh commented 2 years ago

I see! Thank you for the very quick response!

ggonnella commented 2 years ago

Alternatively, you could consider using my library (not yet published, but publicly available) textformats which also has a Python interface and a GFA1 specification (file https://github.com/ggonnella/textformats/spec/gfa/gfa1.yaml). It is written in Nim and is much faster for large files.

ggonnella commented 2 years ago

However, it does not offer all operations on the graph, that gfapy offers, since it is generical.