Question: Thoughts about a standalone Python package/utility?

armish commented 9 years ago

The current way PhyloWGS is distributed, it is a few C++/Pyhton/HTML utilities bind together (which is completely fine), but the current structure makes strong assumptions about the environment the tool is running on; e.g. input/output file names, availability of text and etc. This also makes it harder to make it part of a workflow as one has to watch for logs/output files and make sure that there are no overlapping outputs from multiple jobs running at the same time.

Two quick questions on this: 1) Is there a specific reason as to why mh part of the software was written in CPP instead of Python? 2) Are there any plans making this utility available as a Python package/script where all jobs are fully customizable (either programmatically or via command line) so that the state of each run can be observed in a better way and each run can easily be integrated into a common workflow?

armish commented 9 years ago

One more thing:

3) Will it make sense to separate the visualization part as a web service where people can either submit their JSON file to you or serve their own visualization server as needed? It is a little bit cumbersome to move the files over to the witness directory and spin an HttpServer via Python.

quaidmorris commented 9 years ago

Thanks for the feedback!

jwintersinger commented 9 years ago

Hi Armish,

Thanks for your feedback!

mh is in C++ instead of Python to make it faster, as I understand it's a bottleneck. (Amit or Shankar could better comment on this.)
I don't quite follow your question about making a standalone utility. What do you mean by "fully customizable"? Can you give me a concrete example of what you'd like, and what you would need to permit integration into a common workflow?
I hadn't thought about hosting a visualization web service. This is something we could definitely do, but we've avoided such things until now as mutation names, which Witness will eventually render, indicate SSM chromosome and locus, which is often treated as controlled data. Our rationale in distributing the standalone visualizer was that this would keep the data entirely local on users' systems, and that running the web server wouldn't be unduly difficult, given that it's already included in stock Python installations. We should perhaps revisit this, though, to see if we can simplify the workflow so moving the files into the proper locations is less painful.

armish commented 9 years ago

Hi Jeff,

Thanks for your responses. Sorry if my questions were not that clear. Here are some more details:

Our Cycledash tool depends on many Python/Javascript modules and even on some sysadmin helper utilities, but we are trying to avoid manual installation steps as much as possible for other people to easily use our software. Cycledash is a tool to investigate VCF files and I am trying to annotate these files and the variants contained within them with useful information for the researchers/clinicians (e.g. variant -> effect, variant -> gene name, etc.).

I am interested in wrapping PyhloWGS as a Cycledash annotator where the input will be a VCF file and the output is going to be a simple summary of clonality inferred from this particular file. This will require me to start a job (worker) whenever a new VCF file is submitted, collect the results and show it in the user interface when the job is finished.

What I had in mind was to have PyloWGS as a Python module, so I can programmatically control it without calls to external binaries. This will allow me to load the VCF in memory, evolve the model and retrieve the results as objects (instead of files) where I can decide to convert to other formats or save in the database as needed. The current setup makes these hard as I have to run evolve.py, which depends on mh, and monitor the files and the state of this script to be able to handle job states. Moreover the results are converted into different file formats (write_results) with the use of other helpers scripts. Don't get me wrong: these are all doable programmatically, but it would not be my preferred method (hence my asking for a standalone library).

Ideally, we would have that mh part in Python (although it is going to be a compromise in performance) and all script functionalities available as methods from the PhyloWGS (where these methods can be combined to re-create the current scripts). That is also why I asked about isolating the visualization part as it would be hard to package that within a standalone tool.

Of course it is completely OK, if you guys don't want to go down the standalone utility tool (as I am guessing you all have many other things on your plates); but I just wanted to explore our options here and get your feedback on alternatives.

Best, -- Arman

AmitDeshwar commented 9 years ago

A pure python implementation is at least an order of magnitude slower than the current setup.

We could write a wrapper module that takes in a VCF and returns you the trees or the trees in JSON format which might accomplish what you want.

On Wed, 4 Nov 2015 at 14:15 B. Arman Aksoy notifications@github.com wrote:

Hi Jeff,

Thanks for your responses. Sorry if my questions were not that clear. Here are some more details:

Our Cycledash https://github.com/hammerlab/cycledash tool depends on many Python/Javascript modules and even on some sysadmin helper utilities, but we are trying to avoid manual installation steps as much as possible for other people to easily use our software. Cycledash is a tool to investigate VCF files and I am trying to annotate these files and the variants contained within them with useful information for the researchers/clinicians (e.g. variant -> effect, variant -> gene name, etc.).

I am interested in wrapping PyhloWGS as a Cycledash annotator where the input will be a VCF file and the output is going to be a simple summary of clonality inferred from this particular file. This will require me to start a job (worker) whenever a new VCF file is submitted, collect the results and show it in the user interface when the job is finished.

What I had in mind was to have PyloWGS as a Python module, so I can programmatically control it without calls to external binaries. This will allow me to load the VCF in memory, evolve the model and retrieve the results as objects (instead of files) where I can decide to convert to other formats or save in the database as needed. The current setup makes these hard as I have to run evolve.py, which depends on mh, and monitor the files and the state of this script to be able to handle job states. Moreover the results are converted into different file formats ( write_results) with the use of other helpers scripts. Don't get me wrong: these are all doable programmatically, but it would not be my preferred method (hence my asking for a standalone library).

Ideally, we would have that mh part in Python (although it is going to be a compromise in performance) and all script functionalities available as methods from the PhyloWGS (where these methods can be combined to re-create the current scripts). That is also why I asked about isolating the visualization part as it would be hard to package that within a standalone tool.

Of course it is completely OK, if you guys don't want to go down the standalone utility tool (as I am guessing you all have many other things on your plates); but I just wanted to explore our options here and get your feedback on alternatives.

Best, -- Arman

— Reply to this email directly or view it on GitHub https://github.com/morrislab/phylowgs/issues/13#issuecomment-153834955.

armish commented 9 years ago

Aha, I see — maybe it makes sense to wrap that C++ code for Python then. I never tried it before, but since it is just a single file that we are talking about, it shouldn't be that hard. I can give it a shot next week if you that sounds OK.

jwintersinger commented 8 years ago

Hi @armish, I'm going to close this because it looks resolved. Open another issue if it isn't.

morrislab / phylowgs

Question: Thoughts about a standalone Python package/utility? #13