Python bindings for mmseqs2

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

MIT License

1.42k stars 195 forks source link

Suggestion of a feature

It would be extremely useful if mmseqs have native python bindings. Results from the search could be returned as pandas Dataframe (which is standard python tabular format). It would simplify the usage of mmseqs2 as a building block for bioinformatics aplications. Such bindings could be implemented using pybind (https://github.com/pybind/pybind11).

Currently we are interested in this feature as a whole organisation, however we don't have enough resources to create and maintain bindings by ourselves. With little to no change mmseqs2 could be also used as a Python framework for biological sequences manipulation (similar to Biopython of Biotite, but faster and dedicated for large volumes of sequences).

Possible collaboration

We want to know if you are interested in helping us develop and maintain the bindings. If yes, then we would like to see tight future collaboration to make mmseqs2 more accessible. Bindings would make it easier to use in standard data science pipelines and much more flexible.

That looks already like an impressive amount of work for a PoC.

A few (disjointed) thoughts:

How is error management handled in your project? We rely on exit() quite heavily as you have probably noticed already and in these cases we don't free memory or file handles anymore either. Refactoring that would be a major undertaking.
We would prefer to not give API/ABI stability guarantees, ideally we would only have to worry about keeping the command line interface as stable as possible. Personally, I would have tried auto-generating bindings by extracting information from MMseqsBase.cpp and Parameters.h/cpp and keep MMseqs2 as a separate binary around.
Another thing that we have only realized a few years after having started MMseqs2 is that we have essentially build a database management system for sequence data. I think if we had the chance to reimagine MMseqs2 as something new and consumable directly through APIs I would emphasize this aspect.
Do you have any experience with Python and Rust? There is a good chance we will add the first Rust dependency within the next half year or so and we are interesting in investigating something so new modules could be written in Rust instead of C++ (but continue to use the existing C++ classes). This might make the build system much more complicated in the near future.

soedinglab / MMseqs2

Python bindings for mmseqs2 #450

Suggestion of a feature

PoC

Possible collaboration