Open anderspitman opened 6 years ago
We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython:
1) Setup a project in Python that uses distutils
and cython
modules and specify the CPP files
2) Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called getSingleEndMapping
which would take in file arguments and return a data buffer.
3) A Python file that handles all the modules and would most likely parse the command line options I think.
There are two issues that stand out to me though: 1) Passing data between CPP and Python can be tricky and error-prone 2) Not sure if some of the libraries (NextGenMap) build any libraries to link against, so might need to compile it ourselves.
I think it'd be interesting to try as I think it would help organize all the different dependencies and allow for more OOP design which would help #36
I am skeptical that cython would help us out here. The best candidate to be ported to Python is the Bash-based front end, and we don't need cython for that. Unless I missed something.
On Tue, Oct 10, 2017 at 3:55 PM, Zach notifications@github.com wrote:
We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython:
- Setup a project in Python that uses distutils and cython modules and specify the CPP files
- Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called getSingleEndMapping which would take in file arguments and return a data buffer.
- A Python file that handles all the modules and would most likely parse the command line options I think.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-335631781, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHtHDhJDWShsViGEAJav7qEhWkoB9ks5sq_XGgaJpZM4PYSgW .
This is what I had in mind. It's still a work in progress (still trying to figure out the NGM api), but it would allow us to use NGM as a Python module, and that's what I think might help in the process of porting the BASH code to Python. As far as I know there isn't a NGM Python module (which is also the case for Bowtie and some of the assemblers I think).
@rachelss, what is your opinion about migrating the main bash script to python?
With Whitezed mostly wrapping up, I'm in a good place to start working on this.
Looking long term I like @zmertens idea of potentially creating Python wrappers for some of the C++ assemblers and other projects we're using. The advantage is that it let's users automatically install these dependencies. The problem is we would then have to maintain packages (probably conda packages) for these softwares. It's something to look into though.
For now it's fine to just invoke them through subprocesses.
Anyone ever used Luigi? Looks interesting. I'm checking it out now.
Converting dependent programs to Cython is not going to provide enough of a performance boost to merit the effort involved. The majority of time is spent inside the external programs and reducing process invocations is not going to merit any noticeable increase.
@reedacartwright, @zmertens, and I discussed wrapping offline and decided that the programs in question are common enough in the community to require separate installation, rather than trying to maintain python wrappers ourselves.
@rachelss would it be ok to target Python 3 with the port?
Yes we need a complete upgrade to python3. Python 2 is now officially on its way out.
Excellent. Do you happen to know which version of Python 3 it's safe to require of the intended users?
I'm not sure. Can we use 3.6.3? What can we do to ensure backwards compatibility? Or to throw errors if someone is running an old version and there's a compatibility issue? Can we use Docker containers to minimize issues?
I agree that the programs we utilize are common and likely to be installed already by most users of SISRS.
3.6 would be fantastic. That includes every feature I might want to use. It really depends on what you want to support. It's a balancing act between using modern language features and supporting users with old systems. We can certainly aim for 3.6 and issue warnings for anyone using an older python. However, the only real solution for those users would be to install something like Miniconda. In my opinion everyone should be using it anyway, but your users might not agree.
@rachelss @BobLiterman just wanted to give an update on this. I've been steadily working on the port, and so far I've basically implemented alignContigs
, identifyFixedSites
, and outputAlignment
. Should be on track to pretty much wrap up sites
functionality fairly soon. I'm also adding high level tests as I go to make sure each command is basically working the same as bin/sisrs
(based on the excellent small dataset @BobLiterman provided). I decided to stick with Python 2 for now to avoid changing the current scripts too much. Also the library I'm using for argument parsing (Click) doesn't seem to like python 3 yet. I'm actually thinking about changing away from Click though. Let me know if you have any questions or requests.
Oh I've also created a Dockerfile for development and running the tests. This solves the CI issues I was working with @BobLiterman on a while back (see #37). It also should make it much easier for users to get SISRS up and running with all the dependencies. All they need is docker installed. You can see all my changes on this branch: https://github.com/anderspitman/SISRS/tree/python-port
I have been tinkering with some memory reduction adjustments (mainly to getPrunedDict and outputAlignment) and the Python port will really help streamline things.
Don't worry about incorporating these into port v1, as they are easy to plug in down the road, I just wanted to show you what was new on our side. https://github.com/BobLiterman/SISRS/tree/MemorySaver?files=1
@rachelss @BobLiterman I've completed porting of alignContigs
, identifyFixedSites
, outputAlignment
, and changeMissing
. The code isn't super clean (and still shells out to unix tools a lot more than it needs to), but I do have integration-level tests in place so refactoring shouldn't be too dangerous.
At this point I think I'm about ready to officially merge in the python port in some capacity, and it's probably time to start discussing what route forward we want to take. A few thoughts I've had:
We could simply merge in my branch and have the bash/python versions live side-by-side (currently I have it set up so pip installs both sisrs
and sisrs-python
scripts). My concern with this though is that if a user runs into a missing feature in the python version, they'll simply switch to the bash version. Or just never use the python version in the first place.
One alternative would be to put the python code in its own repo. We could call this py-sisrs
or something like that. Bumping it to SISRS2 might be a better option, and would probably help with user adoption.
In order to really get this port off the ground, we're going to need at least one new killer feature. No one (including you guys, I would guess) is going to use this thing if it doesn't bring anything new to the table, because right now the only real advantages have to do with an improved development experience and more maintainable code. Plus there are still going to be a lot of missing features and other warts in the short term. So can you guys think of anything I could implement that would convince you to start using the python port today, in spite of the drawbacks? Maybe running on multiple nodes?
My goal at this point is that any new features get added to the python port, rather than the bash script. So really my question is what do I need to do to get the port to a point where that's feasible in your eyes?
What about something like a 'scheduler mode' where if the user was running SISRS on a multi-node cluster with a scheduler (Torque, SLURM, etc), the program could auto-generate scripts to be submitted (based on a template script supplied)? Then, steps like Bowtie could be run in parallel through automated script submission? Just spitballing things I've thought of.
Hey @BobLiterman, I've been reading up on distributed computing, since that does seem like an obvious big feature to add. For your proposed scheduler mode, would we need to implement compatibility with multiple schedulers, or would SLURM be sufficient? Also, SISRS uses pretty large data files. How is the data normally distributed for computations like this?
Theoretically, one could provide a dummy script for whatever scheduler they have, and the SISRS script could use that to generate multiple HPC scripts, independent of scheduler. User supplies dummy script and command [sbatch, qsub, etc] to submit scripts as an argument.
In terms of data, we may be able to generate some guidelines based on the data to estimate final data sizes?
That scheduler script conversion would be slick, but likely a lot of work. You're essentially transpiling between multiple languages. Getting something working would probably be pretty straight-forward, but you could get killed by the corner cases. Not saying it isn't worth doing.
For data, my question is more general. If I do a run of SISRS on my local machine, all the input data is there on my hard drive. If I'm trying to distribute that work across nodes, some or all of that data is going to have to be made available to those nodes. How does that usually work? I have very little cluster programming exposure.
A) Scheudler scripts just make system calls, so it really wouldn't be hard to implement. I actually do it myself manually now. I have a script generator I can share to give you a sense of it.
B) Data on a cluster is also stored such that it's accessible by all nodes typically. No worries there.
A) ok maybe I'm misunderstanding the nature of the problem. I'd like to take a look at that generator for sure
B) excellent
Emailed. Couldn't attach here from phone
Thanks!
@BobLiterman interacting with job schedulers seems doable, but I'm concerned it might not be the biggest bang for the buck. The risk we run is the entire python port never being used. If SISRS is going to go away in the not-to-distant future, then this isn't too big of a deal. But I think we have bigger aspirations for the project, and I really think SISRS will be much more maintainable and extendable if it's written in Python. What I'm trying to do right now is get the port over that last hump into viability. But since I don't use SISRS on a daily basis I can't say what the best way to do that would be. Do you have any other ideas? @rachelss any input?
Again, I think the short term goal should be to get to a point where when you reach for SISRS, you reach for the Python version instead of the old version. What would it take to get to that point?
I think we need to find a time to pause work with the bash version, make the jump to python, and start working with it and fixing bugs. @BobLiterman is the only one using it regularly so it's up to him when he can stop. I'd say sooner.
@BobLiterman when do you want to make the jump to the all-python version and start troubleshooting?
Functionally, I'm out of SISRS right now as I'm onto the phylogentics analysis so I have no immediate plans to change base-SISRS.
If everyone is on board, I can start to incorporate the new features (streaming pileups, etc) into Anders' port so we can get the new all-Python official version.
That sounds good to me, and I'm happy to help with the process. Biggest question in my mind is whether we merge the python code here or start a new repo and call it SISRS2 or something. If we merge it here we really should get rid of the bin/sisrs bash script, but that doesn't make sense to me since there's still a lot of missing features (ie loci), even if they aren't used as much.
I think you should merge the python code here so users will find it from the paper.
On Feb 21, 2018 15:21, "Anders Pitman" notifications@github.com wrote:
That sounds good to me, and I'm happy to help with the process. Biggest question in my mind is whether we merge the python code here or start a new repo and call it SISRS2 or something. If we merge it here we really should get rid of the bin/sisrs bash script, but that doesn't make sense to me since there's still a lot of missing features (ie loci), even if they aren't used as much.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-367495894, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHmWB28KcJMvasTzII1sFC5U8cze_ks5tXJbNgaJpZM4PYSgW .
Let's mark the bash version as a release. That will make it easy to checkout if needed. We could also put it on its own dead-end branch.
Alright folks. Here's the update.
1) I created two SISRS releases of the Bash version (one pre- and one post-memory streamlining implementations. (1.6.1, 1.6.2)
2) I created a new branch of Rachel's SISRS (Python_Port) to bring in Anders' Python port, which will becoming SISRS 2.0 once everything is troubleshot and appears to be in working order.
Best, Bob
@rachelss I'm assuming the assembler functionality is still desirable to have built in to SISRS? That's one of the biggest features still missing. I'm currently starting work on porting that over with velvet as the default, but with a plugin system to make it easy for other developers to add additional assemblers in the future.
@BobLiterman do by chance have a pre-assembled version of the test data I could use to test the assemblers as I tackle this?
The test data is already assembled as 'premade' (contigs.fa in premadeoutput).
The data in the species folder could be used to assemble a de novo genome, but that process (to my knowledge) is NOT deterministic, especially if using the read subsamping script.
The assembly step is a great addition to the pipeline, but it adds a lot of overhead with respect to resources/run time. That's the whole reason I allowed for the side-stepping of it through the premade option. But from the user's perspective, the ability to go from reads to alignment in one step is attractive.
Also, it may be beneficial to replace the read subsampling module with a series of BBMap calls (reformat.sh). When I was running SISRS on whole-genome data, the read subsampling step would take MUCH longer than a reformat.sh call (which can subset reads from a .fq OR .fq.gz). Given a user-supplied genome size argument and a number of species, figuring out the number of bases required for ~10X is straightforward, and this would allow gzipped files to used as opposed to just uncompressed FASTQ files.
@anderspitman @reedacartwright @rachelss
What do you think about setting up a virtual meeting to chat about current progress of the port and future directions? With lots of parallel collaboration happening now, which is a first for this former lone-wolf, it would help me out a lot to lay down some concrete plans about what needs to be done and by whom.
Monday perhaps?
Bob
Next week is not a good week for me. I have more openings the week after.
I'm pretty flexible. Tue, Thu, Fri are best in general.
How about the 3rd? 3:00 ET = 12:00 AZ?
Can we push it back to 1:00pm AZ? My class ends at noon that day, and I would like to eat before I participate in a potentially long conference call.
Availability: http://links.asu.edu/CartwrightCalendar Address: The Biodesign Institute, PO Box 875301, Tempe, AZ 85287-5301 USA Packages: The Biodesign Institute, 1001 S. McAllister Ave, Tempe, AZ 85287-5301 USA Office: Biodesign B-220C, 1-480-965-9949 Website: http://cartwrig.ht/
On Thu, Mar 22, 2018 at 12:59 PM, rachelss notifications@github.com wrote:
How about the 3rd? 3:00 ET = 12:00 AZ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-375438232, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHlNwf2_LMbftbKF5LJ1HkoN16KZqks5thAKygaJpZM4PYSgW .
Works for me 4:00 ET
Unfortunately I have to pick my son up from daycare by 5pm ET, so if it's possible to split the difference and call it 3:30 ET/12:30 AZ, I could at least be in for the beginning.
Monday is probably not going to work for me after all. I can do something in mid april after my tenure packet is submitted.
WARNING: Given the same input data, the bash and Python versions have different final outputs.
RAL_Memory
11724 total variable sites (alignment.nex) 5111 variable sites are singletons 6428 total biallelic sites excluding singletons (alignment_bi.nex) 6613 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11724 sites from alignment.nex (6 allowed missing) are reduced to 11724 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6613 sites from alignment_pi.nex (6 allowed missing) are reduced to 6613 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6428 sites from alignment_bi.nex (6 allowed missing) are reduced to 6428 sites (0 sites or 0.00% lost)
RAL_SE
11724 total variable sites (alignment.nex) 5111 variable sites are singletons 6428 total biallelic sites excluding singletons (alignment_bi.nex) 6613 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11724 sites from alignment.nex (6 allowed missing) are reduced to 11724 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6613 sites from alignment_pi.nex (6 allowed missing) are reduced to 6613 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6428 sites from alignment_bi.nex (6 allowed missing) are reduced to 6428 sites (0 sites or 0.00% lost)
Current_Python
11728 total variable sites (alignment.nex) 5153 variable sites are singletons 6391 total biallelic sites excluding singletons (alignment_bi.nex) 6575 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11728 sites from alignment.nex (6 allowed missing) are reduced to 11728 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6575 sites from alignment_pi.nex (6 allowed missing) are reduced to 6575 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6391 sites from alignment_bi.nex (6 allowed missing) are reduced to 6391 sites (0 sites or 0.00% lost)
ADDITION: All scripts were run twice and yielded the same response, so it's not randomness somewhere in the pipeline. It's a deterministic-type mismatch.
Weird. What data are you using? I'll look into it
@reedacartwright and I wanted to start a conversation about possibly porting the bash portions of SISRS to Python. We both feel this would make it easier to maintain in the long run, but the work to do it may certainly be nontrivial. This is something I could possibly do myself or at least come up with a process whereby we all do it incrementally. Initial thoughts @rachelss?