rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

Port to Python #36

Open anderspitman opened 6 years ago

anderspitman commented 6 years ago

@reedacartwright and I wanted to start a conversation about possibly porting the bash portions of SISRS to Python. We both feel this would make it easier to maintain in the long run, but the work to do it may certainly be nontrivial. This is something I could possibly do myself or at least come up with a process whereby we all do it incrementally. Initial thoughts @rachelss?

anderspitman commented 6 years ago

Advantages

Hurdles

zmertens commented 6 years ago

We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython: 1) Setup a project in Python that uses distutils and cython modules and specify the CPP files 2) Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called getSingleEndMapping which would take in file arguments and return a data buffer. 3) A Python file that handles all the modules and would most likely parse the command line options I think.

There are two issues that stand out to me though: 1) Passing data between CPP and Python can be tricky and error-prone 2) Not sure if some of the libraries (NextGenMap) build any libraries to link against, so might need to compile it ourselves.

I think it'd be interesting to try as I think it would help organize all the different dependencies and allow for more OOP design which would help #36

reedacartwright commented 6 years ago

I am skeptical that cython would help us out here. The best candidate to be ported to Python is the Bash-based front end, and we don't need cython for that. Unless I missed something.

On Tue, Oct 10, 2017 at 3:55 PM, Zach notifications@github.com wrote:

We could use Cython to create Python modules for the C++ libraries that don't have Python modules already available. The process is pretty straightforward for creating a Python module from a CPP library with Cython:

  1. Setup a project in Python that uses distutils and cython modules and specify the CPP files
  2. Create a Python file which basically defines the implementation of the CPP class being used. The external CPP class exposes whatever functions it needs to perform the computations in Python. For instance, it might have a function called getSingleEndMapping which would take in file arguments and return a data buffer.
  3. A Python file that handles all the modules and would most likely parse the command line options I think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-335631781, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHtHDhJDWShsViGEAJav7qEhWkoB9ks5sq_XGgaJpZM4PYSgW .

zmertens commented 6 years ago

This is what I had in mind. It's still a work in progress (still trying to figure out the NGM api), but it would allow us to use NGM as a Python module, and that's what I think might help in the process of porting the BASH code to Python. As far as I know there isn't a NGM Python module (which is also the case for Bowtie and some of the assemblers I think).

reedacartwright commented 6 years ago

@rachelss, what is your opinion about migrating the main bash script to python?

anderspitman commented 6 years ago

With Whitezed mostly wrapping up, I'm in a good place to start working on this.

anderspitman commented 6 years ago

Looking long term I like @zmertens idea of potentially creating Python wrappers for some of the C++ assemblers and other projects we're using. The advantage is that it let's users automatically install these dependencies. The problem is we would then have to maintain packages (probably conda packages) for these softwares. It's something to look into though.

anderspitman commented 6 years ago

For now it's fine to just invoke them through subprocesses.

anderspitman commented 6 years ago

Anyone ever used Luigi? Looks interesting. I'm checking it out now.

reedacartwright commented 6 years ago

Converting dependent programs to Cython is not going to provide enough of a performance boost to merit the effort involved. The majority of time is spent inside the external programs and reducing process invocations is not going to merit any noticeable increase.

anderspitman commented 6 years ago

@reedacartwright, @zmertens, and I discussed wrapping offline and decided that the programs in question are common enough in the community to require separate installation, rather than trying to maintain python wrappers ourselves.

anderspitman commented 6 years ago

@rachelss would it be ok to target Python 3 with the port?

rachelss commented 6 years ago

Yes we need a complete upgrade to python3. Python 2 is now officially on its way out.

anderspitman commented 6 years ago

Excellent. Do you happen to know which version of Python 3 it's safe to require of the intended users?

rachelss commented 6 years ago

I'm not sure. Can we use 3.6.3? What can we do to ensure backwards compatibility? Or to throw errors if someone is running an old version and there's a compatibility issue? Can we use Docker containers to minimize issues?

I agree that the programs we utilize are common and likely to be installed already by most users of SISRS.

anderspitman commented 6 years ago

3.6 would be fantastic. That includes every feature I might want to use. It really depends on what you want to support. It's a balancing act between using modern language features and supporting users with old systems. We can certainly aim for 3.6 and issue warnings for anyone using an older python. However, the only real solution for those users would be to install something like Miniconda. In my opinion everyone should be using it anyway, but your users might not agree.

anderspitman commented 6 years ago

@rachelss @BobLiterman just wanted to give an update on this. I've been steadily working on the port, and so far I've basically implemented alignContigs, identifyFixedSites, and outputAlignment. Should be on track to pretty much wrap up sites functionality fairly soon. I'm also adding high level tests as I go to make sure each command is basically working the same as bin/sisrs (based on the excellent small dataset @BobLiterman provided). I decided to stick with Python 2 for now to avoid changing the current scripts too much. Also the library I'm using for argument parsing (Click) doesn't seem to like python 3 yet. I'm actually thinking about changing away from Click though. Let me know if you have any questions or requests.

anderspitman commented 6 years ago

Oh I've also created a Dockerfile for development and running the tests. This solves the CI issues I was working with @BobLiterman on a while back (see #37). It also should make it much easier for users to get SISRS up and running with all the dependencies. All they need is docker installed. You can see all my changes on this branch: https://github.com/anderspitman/SISRS/tree/python-port

BobLiterman commented 6 years ago

I have been tinkering with some memory reduction adjustments (mainly to getPrunedDict and outputAlignment) and the Python port will really help streamline things.

Don't worry about incorporating these into port v1, as they are easy to plug in down the road, I just wanted to show you what was new on our side. https://github.com/BobLiterman/SISRS/tree/MemorySaver?files=1

anderspitman commented 6 years ago

@rachelss @BobLiterman I've completed porting of alignContigs, identifyFixedSites, outputAlignment, and changeMissing. The code isn't super clean (and still shells out to unix tools a lot more than it needs to), but I do have integration-level tests in place so refactoring shouldn't be too dangerous.

At this point I think I'm about ready to officially merge in the python port in some capacity, and it's probably time to start discussing what route forward we want to take. A few thoughts I've had:

  1. We could simply merge in my branch and have the bash/python versions live side-by-side (currently I have it set up so pip installs both sisrs and sisrs-python scripts). My concern with this though is that if a user runs into a missing feature in the python version, they'll simply switch to the bash version. Or just never use the python version in the first place.

  2. One alternative would be to put the python code in its own repo. We could call this py-sisrs or something like that. Bumping it to SISRS2 might be a better option, and would probably help with user adoption.

  3. In order to really get this port off the ground, we're going to need at least one new killer feature. No one (including you guys, I would guess) is going to use this thing if it doesn't bring anything new to the table, because right now the only real advantages have to do with an improved development experience and more maintainable code. Plus there are still going to be a lot of missing features and other warts in the short term. So can you guys think of anything I could implement that would convince you to start using the python port today, in spite of the drawbacks? Maybe running on multiple nodes?

My goal at this point is that any new features get added to the python port, rather than the bash script. So really my question is what do I need to do to get the port to a point where that's feasible in your eyes?

BobLiterman commented 6 years ago

What about something like a 'scheduler mode' where if the user was running SISRS on a multi-node cluster with a scheduler (Torque, SLURM, etc), the program could auto-generate scripts to be submitted (based on a template script supplied)? Then, steps like Bowtie could be run in parallel through automated script submission? Just spitballing things I've thought of.

anderspitman commented 6 years ago

Hey @BobLiterman, I've been reading up on distributed computing, since that does seem like an obvious big feature to add. For your proposed scheduler mode, would we need to implement compatibility with multiple schedulers, or would SLURM be sufficient? Also, SISRS uses pretty large data files. How is the data normally distributed for computations like this?

BobLiterman commented 6 years ago

Theoretically, one could provide a dummy script for whatever scheduler they have, and the SISRS script could use that to generate multiple HPC scripts, independent of scheduler. User supplies dummy script and command [sbatch, qsub, etc] to submit scripts as an argument.

BobLiterman commented 6 years ago

In terms of data, we may be able to generate some guidelines based on the data to estimate final data sizes?

anderspitman commented 6 years ago

That scheduler script conversion would be slick, but likely a lot of work. You're essentially transpiling between multiple languages. Getting something working would probably be pretty straight-forward, but you could get killed by the corner cases. Not saying it isn't worth doing.

For data, my question is more general. If I do a run of SISRS on my local machine, all the input data is there on my hard drive. If I'm trying to distribute that work across nodes, some or all of that data is going to have to be made available to those nodes. How does that usually work? I have very little cluster programming exposure.

BobLiterman commented 6 years ago

A) Scheudler scripts just make system calls, so it really wouldn't be hard to implement. I actually do it myself manually now. I have a script generator I can share to give you a sense of it.

B) Data on a cluster is also stored such that it's accessible by all nodes typically. No worries there.

anderspitman commented 6 years ago

A) ok maybe I'm misunderstanding the nature of the problem. I'd like to take a look at that generator for sure

B) excellent

BobLiterman commented 6 years ago

Emailed. Couldn't attach here from phone

anderspitman commented 6 years ago

Thanks!

anderspitman commented 6 years ago

@BobLiterman interacting with job schedulers seems doable, but I'm concerned it might not be the biggest bang for the buck. The risk we run is the entire python port never being used. If SISRS is going to go away in the not-to-distant future, then this isn't too big of a deal. But I think we have bigger aspirations for the project, and I really think SISRS will be much more maintainable and extendable if it's written in Python. What I'm trying to do right now is get the port over that last hump into viability. But since I don't use SISRS on a daily basis I can't say what the best way to do that would be. Do you have any other ideas? @rachelss any input?

anderspitman commented 6 years ago

Again, I think the short term goal should be to get to a point where when you reach for SISRS, you reach for the Python version instead of the old version. What would it take to get to that point?

rachelss commented 6 years ago

I think we need to find a time to pause work with the bash version, make the jump to python, and start working with it and fixing bugs. @BobLiterman is the only one using it regularly so it's up to him when he can stop. I'd say sooner.

rachelss commented 6 years ago

@BobLiterman when do you want to make the jump to the all-python version and start troubleshooting?

BobLiterman commented 6 years ago

Functionally, I'm out of SISRS right now as I'm onto the phylogentics analysis so I have no immediate plans to change base-SISRS.

If everyone is on board, I can start to incorporate the new features (streaming pileups, etc) into Anders' port so we can get the new all-Python official version.

anderspitman commented 6 years ago

That sounds good to me, and I'm happy to help with the process. Biggest question in my mind is whether we merge the python code here or start a new repo and call it SISRS2 or something. If we merge it here we really should get rid of the bin/sisrs bash script, but that doesn't make sense to me since there's still a lot of missing features (ie loci), even if they aren't used as much.

reedacartwright commented 6 years ago

I think you should merge the python code here so users will find it from the paper.

On Feb 21, 2018 15:21, "Anders Pitman" notifications@github.com wrote:

That sounds good to me, and I'm happy to help with the process. Biggest question in my mind is whether we merge the python code here or start a new repo and call it SISRS2 or something. If we merge it here we really should get rid of the bin/sisrs bash script, but that doesn't make sense to me since there's still a lot of missing features (ie loci), even if they aren't used as much.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-367495894, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHmWB28KcJMvasTzII1sFC5U8cze_ks5tXJbNgaJpZM4PYSgW .

rachelss commented 6 years ago

Let's mark the bash version as a release. That will make it easy to checkout if needed. We could also put it on its own dead-end branch.

BobLiterman commented 6 years ago

Alright folks. Here's the update.

1) I created two SISRS releases of the Bash version (one pre- and one post-memory streamlining implementations. (1.6.1, 1.6.2)

2) I created a new branch of Rachel's SISRS (Python_Port) to bring in Anders' Python port, which will becoming SISRS 2.0 once everything is troubleshot and appears to be in working order.

Best, Bob

anderspitman commented 6 years ago

@rachelss I'm assuming the assembler functionality is still desirable to have built in to SISRS? That's one of the biggest features still missing. I'm currently starting work on porting that over with velvet as the default, but with a plugin system to make it easy for other developers to add additional assemblers in the future.

@BobLiterman do by chance have a pre-assembled version of the test data I could use to test the assemblers as I tackle this?

BobLiterman commented 6 years ago

The test data is already assembled as 'premade' (contigs.fa in premadeoutput).

The data in the species folder could be used to assemble a de novo genome, but that process (to my knowledge) is NOT deterministic, especially if using the read subsamping script.

The assembly step is a great addition to the pipeline, but it adds a lot of overhead with respect to resources/run time. That's the whole reason I allowed for the side-stepping of it through the premade option. But from the user's perspective, the ability to go from reads to alignment in one step is attractive.

Also, it may be beneficial to replace the read subsampling module with a series of BBMap calls (reformat.sh). When I was running SISRS on whole-genome data, the read subsampling step would take MUCH longer than a reformat.sh call (which can subset reads from a .fq OR .fq.gz). Given a user-supplied genome size argument and a number of species, figuring out the number of bases required for ~10X is straightforward, and this would allow gzipped files to used as opposed to just uncompressed FASTQ files.

BobLiterman commented 6 years ago

@anderspitman @reedacartwright @rachelss

What do you think about setting up a virtual meeting to chat about current progress of the port and future directions? With lots of parallel collaboration happening now, which is a first for this former lone-wolf, it would help me out a lot to lay down some concrete plans about what needs to be done and by whom.

Monday perhaps?

Bob

reedacartwright commented 6 years ago

Next week is not a good week for me. I have more openings the week after.

anderspitman commented 6 years ago

I'm pretty flexible. Tue, Thu, Fri are best in general.

rachelss commented 6 years ago

How about the 3rd? 3:00 ET = 12:00 AZ?

reedacartwright commented 6 years ago

Can we push it back to 1:00pm AZ? My class ends at noon that day, and I would like to eat before I participate in a potentially long conference call.

-- Reed A. Cartwright, PhD Barrett Honors Faculty Assistant Professor of Genomics, Evolution, and Bioinformatics School of Life Sciences Human and Comparative Genomics Laboratory The Biodesign Institute Arizona State University

Availability: http://links.asu.edu/CartwrightCalendar Address: The Biodesign Institute, PO Box 875301, Tempe, AZ 85287-5301 USA Packages: The Biodesign Institute, 1001 S. McAllister Ave, Tempe, AZ 85287-5301 USA Office: Biodesign B-220C, 1-480-965-9949 Website: http://cartwrig.ht/

On Thu, Mar 22, 2018 at 12:59 PM, rachelss notifications@github.com wrote:

How about the 3rd? 3:00 ET = 12:00 AZ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rachelss/SISRS/issues/36#issuecomment-375438232, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGOHlNwf2_LMbftbKF5LJ1HkoN16KZqks5thAKygaJpZM4PYSgW .

rachelss commented 6 years ago

Works for me 4:00 ET

BobLiterman commented 6 years ago

Unfortunately I have to pick my son up from daycare by 5pm ET, so if it's possible to split the difference and call it 3:30 ET/12:30 AZ, I could at least be in for the beginning.

reedacartwright commented 6 years ago

Monday is probably not going to work for me after all. I can do something in mid april after my tenure packet is submitted.

BobLiterman commented 6 years ago

WARNING: Given the same input data, the bash and Python versions have different final outputs.

RAL_Memory

11724 total variable sites (alignment.nex) 5111 variable sites are singletons 6428 total biallelic sites excluding singletons (alignment_bi.nex) 6613 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11724 sites from alignment.nex (6 allowed missing) are reduced to 11724 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6613 sites from alignment_pi.nex (6 allowed missing) are reduced to 6613 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6428 sites from alignment_bi.nex (6 allowed missing) are reduced to 6428 sites (0 sites or 0.00% lost)

RAL_SE

11724 total variable sites (alignment.nex) 5111 variable sites are singletons 6428 total biallelic sites excluding singletons (alignment_bi.nex) 6613 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11724 sites from alignment.nex (6 allowed missing) are reduced to 11724 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6613 sites from alignment_pi.nex (6 allowed missing) are reduced to 6613 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6428 sites from alignment_bi.nex (6 allowed missing) are reduced to 6428 sites (0 sites or 0.00% lost)

Current_Python

11728 total variable sites (alignment.nex) 5153 variable sites are singletons 6391 total biallelic sites excluding singletons (alignment_bi.nex) 6575 total variable sites excluding singletons (alignment_pi.nex) With 6 taxa allowed to be missing, 11728 sites from alignment.nex (6 allowed missing) are reduced to 11728 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6575 sites from alignment_pi.nex (6 allowed missing) are reduced to 6575 sites (0 sites or 0.00% lost) With 6 taxa allowed to be missing, 6391 sites from alignment_bi.nex (6 allowed missing) are reduced to 6391 sites (0 sites or 0.00% lost)

ADDITION: All scripts were run twice and yielded the same response, so it's not randomness somewhere in the pipeline. It's a deterministic-type mismatch.

anderspitman commented 6 years ago

Weird. What data are you using? I'll look into it