soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.42k stars 195 forks source link

Python bindings for mmseqs2 #450

Open styczynski opened 3 years ago

styczynski commented 3 years ago

Suggestion of a feature

It would be extremely useful if mmseqs have native python bindings. Results from the search could be returned as pandas Dataframe (which is standard python tabular format). It would simplify the usage of mmseqs2 as a building block for bioinformatics aplications. Such bindings could be implemented using pybind (https://github.com/pybind/pybind11).

Currently we are interested in this feature as a whole organisation, however we don't have enough resources to create and maintain bindings by ourselves. With little to no change mmseqs2 could be also used as a Python framework for biological sequences manipulation (similar to Biopython of Biotite, but faster and dedicated for large volumes of sequences).

PoC

We created a very rough PoC with bypind11 to examine how easy it is to create api wrappers. The bindings are far from being production-ready and the project was just used as a proof that this is possible.

Possible collaboration

We want to know if you are interested in helping us develop and maintain the bindings. If yes, then we would like to see tight future collaboration to make mmseqs2 more accessible. Bindings would make it easier to use in standard data science pipelines and much more flexible.

Covid Genomics contact email: contact@covidgenomics.com

milot-mirdita commented 3 years ago

That looks already like an impressive amount of work for a PoC.

A few (disjointed) thoughts: