Open styczynski opened 3 years ago
That looks already like an impressive amount of work for a PoC.
A few (disjointed) thoughts:
How is error management handled in your project? We rely on exit()
quite heavily as you have probably noticed already and in these cases we don't free memory or file handles anymore either. Refactoring that would be a major undertaking.
We would prefer to not give API/ABI stability guarantees, ideally we would only have to worry about keeping the command line interface as stable as possible. Personally, I would have tried auto-generating bindings by extracting information from MMseqsBase.cpp
and Parameters.h/cpp
and keep MMseqs2 as a separate binary around.
Another thing that we have only realized a few years after having started MMseqs2 is that we have essentially build a database management system for sequence data. I think if we had the chance to reimagine MMseqs2 as something new and consumable directly through APIs I would emphasize this aspect.
Do you have any experience with Python and Rust? There is a good chance we will add the first Rust dependency within the next half year or so and we are interesting in investigating something so new modules could be written in Rust instead of C++ (but continue to use the existing C++ classes). This might make the build system much more complicated in the near future.
Suggestion of a feature
It would be extremely useful if mmseqs have native python bindings. Results from the search could be returned as pandas Dataframe (which is standard python tabular format). It would simplify the usage of mmseqs2 as a building block for bioinformatics aplications. Such bindings could be implemented using pybind (https://github.com/pybind/pybind11).
Currently we are interested in this feature as a whole organisation, however we don't have enough resources to create and maintain bindings by ourselves. With little to no change mmseqs2 could be also used as a Python framework for biological sequences manipulation (similar to Biopython of Biotite, but faster and dedicated for large volumes of sequences).
PoC
We created a very rough PoC with bypind11 to examine how easy it is to create api wrappers. The bindings are far from being production-ready and the project was just used as a proof that this is possible.
Possible collaboration
We want to know if you are interested in helping us develop and maintain the bindings. If yes, then we would like to see tight future collaboration to make mmseqs2 more accessible. Bindings would make it easier to use in standard data science pipelines and much more flexible.
Covid Genomics contact email: contact@covidgenomics.com