marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
387 stars 91 forks source link

CAPI and Python Bindings #49

Open positiveblue opened 7 years ago

positiveblue commented 7 years ago

First of all congratulations for Mash.

After reading the paper about mash I decided to try it, and I am really impressed with the good results I have obtained.

Right now sketching and comparing samples with mash is fast and easy. However, it can only be used as command line tool. Although cli are an excellent solution, for many cases there are others that could take advantage of mash as a library.

I would like to know what do you thing about having some external API (in C) that exposes the main functionalities and primitives of the library. Having this C API would be nice because then I could write bindings for other languages. The first language that comes to my mind in bioinformatics is python so would be really nice having a python library for it.

I already forked the project and started to expose mash through a C API. This is going to take me some time, so meanwhile, I used the subprocess module in python to expose the cmd getting the results as python dictionaries. The branches are python and mashpy respectively.

Do you think it is a good idea? Would you be interested in having this on Mash?

ondovb commented 7 years ago

Yes, this would certainly be good to have. I'm not familiar with the binding process, but is it necessary to expose the lower level functions (e.g. Hash data structures) for the main interface to function? These seem like implementation details that wouldn't matter to the calling code. I guess it depends on the use case you have in mind, but I would envision Sketch being the central object of an API. We would probably need to make some changes on our end to make that viable though. If we can get draw out an idea of how Mash as a library would be used and what the ideal interface would look like, that would be a good start.

positiveblue commented 7 years ago

Hello @ondovb,

It's good to hear that you are interested in the bindings. One of the best things about Mash is that it does its job, and it does it fast. So the python version has to accomplish the expectations.

There are two ways for calling Mash from Python

A Python module would be an overkill solution. I suggest using ctypes.

On the one hand, it is true that there is some overhead when calling C functions from Python via ctypes, but for intensive CPU functions, like Sketch, we won't note it. On the other hand, ctypes is almost no code-intrusive. If we compile the library as a shared library (.so/.dylib), exposing an API in C, people could not only write bindings for Python but any other language.

You are right about exposing the internal implementation. It is not necessary at all. I started to do it for one main reason: get used to the Mash code.

As you said, some changes would be necessary for the Mash code. Let me think a bit about it and I will show you a brief about how the API should look like and how the python library would be used.

The good news is that bindings/cli tools would have the same backend and many functionalities can be shared. I will write a specification as soon as possible.

Thanks!

positiveblue commented 7 years ago

I am going to implement the first iteration using subprocess in python.

It is going to look like as the final version using the C API but it is going to call Mash as cmd and parse the results. If we do it in this way, we can profile the functions/objects and come with a good API. By know what I have in mind is something like this:

import mash

input_files = ['fileOne.fna', 'fileTwo.fna']

params = mash.sketchParams()
params.k-mers(16)

mash_files= mash.sketch(input_files, params)

# mash_files[0] => fileOne.fna.msh
# mash_files[1] => fileTwo.fna.msh

res = mash.dist(mash_files[0], mash_files[1])

# res => { p_value = 0.022, mash_distance = 0.27, matching_hashe=475, total_hashe=100}

...

I hope to find some time for this project before next week. If you have some constructive feedback I really appreciate it :)

ondovb commented 7 years ago

CTypes seems reasonable to me. My one comment about using a wrapper as a prototype is that you may ultimately want slightly lower-lever exposure than is available from the mash command line, for example passing in raw sequence instead of files. This could always be adjusted in the API spec later though.

positiveblue commented 7 years ago

Hello @ondovb,

I already implemented a "mock up" of mashpy using subprocess under the hood and parsing the output of the command line tool in python. I have been using it to get the distance between some fastq files and draw some plots. It was a "dirty" way to implement it but it has been useful to check what is needed. I know other people who used it and one of them had a problem with the python version (It only works for python 2.7 because of the incompatibility with strings in python 2.7/3.x). To solve those problems is going to be better write the real version one directly.

I am going to start the implementation and I would like to discuss some details with you to be sure that I am doing it in a way aligned with you. The roadmap that I have in mind is:

I can do it all at once but I guess that is better do it in chunks of features (first, for example, write a C++ API only for sketching the genomes and then expose it to C/Python).

My questions are related to:

I am aware that there is much information here but I would like to know your thoughts on these points to proceed in consequence.

As always I am open to discussing/detailing any of what I exposed here so do not hesitate in ask.

Thanks,

Jordi.