rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

Port to Python #36

Open anderspitman opened 7 years ago

anderspitman commented 7 years ago

@reedacartwright and I wanted to start a conversation about possibly porting the bash portions of SISRS to Python. We both feel this would make it easier to maintain in the long run, but the work to do it may certainly be nontrivial. This is something I could possibly do myself or at least come up with a process whereby we all do it incrementally. Initial thoughts @rachelss?

BobLiterman commented 6 years ago

It's the test data you used. Run on the same system, so it can't be a version issue (Save for Py2 v. Py3).

anderspitman commented 6 years ago

Actually it's probably just Py2 vs Py3, because legacy SISRS has to use Py2 and sisrs-python has to use Py3, and there are definitely going to be differences. Theoretically the differences should not have a semantic impact, since they only involve sorting. However, at this early stage we should really confirm I'm not missing something. Can you try running legacy SISRS again but using the bash file from PR #57?

BobLiterman commented 6 years ago

I will give a crack tomorrow. I'm adding the streaming bits now so it will be interesting if the effect goes away...

BobLiterman commented 6 years ago

Relatively easy fix. There was an issue with how bases were being assigned to the species specific genomes and it came down to Py2/Py3 differences in how ties were handled (5 As vs. 5Ts for instance). I have updated the code in my Python and Bash version, and the pytest scripts all pass, and results are identical downstream. This minor hiccup underlies the importance of using mappers that can handle degenerate bases in the reference as a good hit. I have made a note to look into other mappers on an issue I self-assigned.

BobLiterman commented 6 years ago

Hello all,

I have finished up the first round of code swapping into Anders' new Python 3 code.

https://github.com/BobLiterman/SISRS/tree/RAL_Python

Major things I did:

1) Implemented streaming-type pipelines, which don't require multiple large pickled files to be created or loaded 2) Simplified code by removing extraneous bits (e.g. Reference genome data in outputAlignment) 3) Added some nice print messages for the user. 4) Allowed the use of zipped sequence files (.fastq/.fq/.fastq.gz/.fq.gz) 5) Deleted paired end identification (SISRS is completely single ended for the moment).

This current version (save for some intentional folder structure changes) has identical output to my Bash version.

I tried to comment around where I could, so hopefully things are more or less clear. When the assemblers are ready to test, that will take more effort on the multi-node side of things.

Re: Docker:

I had good success loading the program through Docker and running it. For me, the recommended developer instructions seemed problematic, with Git and Docker co-tracking things and working in a sort of Inception style multi-shell mounted framework.

I wound up just making a single alias on my system to cd into the SISRS directory, run setup.py, and cd back to where I was. Code was updated for instant running without lots of extra steps. But that's me.

anderspitman commented 6 years ago

This all sounds great! A couple questions:

With A/T tie thing, I ran into the same issue. Was it in specific_genome.py by chance, or another place?

For the developer instructions, I agree that it is conceptually complicated. Can you give some more detail about the steps you go through to get up and running? It might be worth sacrificing some flexibility in order to simplify the process. Then if developers who are more familiar with Docker want to they can always do a more complex setup like the one I currently have outlined.

anderspitman commented 6 years ago

Also, if you've changed around the directory layout for the output, we'll need to carefully consider how to update the test data to ensure we haven't introduced any regressions.

BobLiterman commented 6 years ago

The A/T was in specific genome, yes.

All that I did to run it was:

1) Clone 2) setup.py install 3) run sisrs-python <> until I broke something. Edit code as needed 4) run pySISRS (cd to SISRS, run setup.py, cd back to where I was) 5) repeat

In terms of data structure, there were a few material changes:

1) Two new files are created in the genome folder: contigs_seqLength.tsv and contigs_LocList (or something like that) 2) No pickles are generated, they are replaced with LocList files for each species. 3) All the alignments just go into the main output folder (alignment.nex, alignment_pi.nex, alignment_bi.nex). Previously, I had subfolders for _bi and _pi, but I just made it so the output files had slightly more descriptive names.

The alignment files themselves are also different, as I removed the reference genome options entirely. I am going to rescue that behavior in a downstream addition.

I suspect the test data would require at least some alteration with all of these changes.

anderspitman commented 6 years ago

Ok it was probably the same place that gave me issues then. If you run into future Py2/Py3 problems you can take a look at PR #57 if you haven't already. Those were all the ones I found while moving to Py3.

I'm assuming you did all that from within the Docker container, right? Are you editing the code from within Docker as well? The only difference I see with what you're doing is you're not using pip install -e. I highly recommend doing it that way. It's a very common way to develop Python code and saves you having to explicitly run setup.py every time you make a change. You can read more about how it works here:

https://pip.pypa.io/en/latest/reference/pip_install/#editable-installs

https://stackoverflow.com/questions/30306099/pip-install-editable-vs-python-setup-py-develop

We'll definitely need to update the tests and test data to reflect those changes. Shouldn't be too big of an issue, we just need to be really careful.

BobLiterman commented 6 years ago

For the record, I didn't end up doing much in the Docker. I was just running sisrs-python on my system, as I have all the pre-reqs.

anderspitman commented 6 years ago

That makes sense. Maybe I update the README to mention the Docker container as an option for getting the pre-reqs but remove all the directory mapping stuff?

BobLiterman commented 6 years ago

Yeah. The Docker setup was a snap with those. I'd leave it at that.

anderspitman commented 6 years ago

Also I just thought of another question. If you're merging your code into the Python_Port branch, how is it that you ran into the same A/T problem that I did in specific_genome.py? The version of the file on that branch already accounts for the ordering problem. Was it just that you had to update your bash sisrs in order to get the same results?

BobLiterman commented 6 years ago

The sort that you had in there was actually unsorting a list I think. sorted() was called on a list that was already sorted by read count depth, which was then being resorted by base I think. It showed up in the subsequent Bowtie runs (the Python had fewer reads mapping than bash).

While munging around in there, I actually found a test case that we hadn't considered biologically and it led me to reshape that step to ensure optimal mapping.

anderspitman commented 6 years ago

@BobLiterman I'm just realizing when I read your last message before I didn't fully process it. If my code was unsorting something that was supposed to be sorted, that's a pretty large cause for concern. It means our current test data is invalid, because it was generating using bash SISRS from PR #57.

BobLiterman commented 6 years ago

There is one niggling issue that I'd like to address prior to generating new test data and that is the allowance of both compressed and uncompressed read files. Most people do not work on raw FASTQ files, but rather gzipped files. The fix was easy up top when detecting directories with data, but there was another FASTQ-only filter in the Bowtie script that I want to replace.

That shouldn't take me very long to splice into the code, and from that point, I feel like we can generate new test data from my newest iteration of SISRS (post-assembly).

anderspitman commented 6 years ago

Sounds good. As long as you're confident the Python version is working, I'm fine with that plan. I think maybe I've been a bit paranoid wanting to make sure my port was 100% matching the output of the bash version, but that's mostly because I lack the background to know whether it's doing what it's supposed to.

anderspitman commented 6 years ago

@BobLiterman what's your status? I recently completed a (mostly untested) implementation of the assembler plugin system, including plugins for velvet and minia. I think that pretty much wraps up the major functionality I wanted to get into place for the initial Python release.

As of right now my branch is still passing the tests, but it won't if I merge in the latest Python_Port branch which includes your RAL additions. Do we want to go ahead with the merges, add your gzip code, then generate fresh test data? As I said before, if you're comfortable verifying that everything is still working, that's certainly an easy option for me. I mean even following strict practices before I let invalid test data in with the sort stuff, due to my lack of domain knowledge.

We're getting really close!

BobLiterman commented 6 years ago

I feel good about my changes as they stand now, and will generate new benchmark runs on the same test dataset with my Python version today.

BobLiterman commented 6 years ago

Hot off the presses, @anderspitman.

Generated from my fork, commit 696df201c79e186673f1f38eea9ae7a66cd839ec

https://github.com/BobLiterman/SISRS/tree/RAL_Python/Test_Data_04192018

I have a log for each step, and a start-to-finish command list, like the previous run.

anderspitman commented 6 years ago

Awesome! I will update the tests to reflect that data, then merge our branches and get the final PR in.

anderspitman commented 6 years ago

Hey @BobLiterman, sorry for the late feedback. School has been brutal this week. I'm working on resolving merge conflicts now. It looks like the only real one is in sisrs.py. I see you made significant changes to the DirectoryLists class, specifically removing the paired and unpaired lists in order to handle the gzip stuff. I'm using those lists in some of the commands I've implemented since you merged my branch. Would it break anything for me to add them back in?

BobLiterman commented 6 years ago

Nope. Not really. Have at it!

anderspitman commented 6 years ago

@BobLiterman a few issues/questions:

  1. Did you use the Docker container when you generated the test data by chance?
  2. I notice that the new test data has broken symlinks in it. If it's ok with you I can just use your branch and the commands you provided to regenerate it from the RawTestData directory. This might be best anyway since I can use the docker image which will better match the Travis test environment.
  3. Do you mind if I roll your branch back a couple commits to remove the data from SISRS itself? I think it would be best to keep it in a separate repo since it's so huge and contains a lot of binary files, which are murder on git.
BobLiterman commented 6 years ago
  1. No
  2. Yes
  3. Yes the test data shouldn't be part of base SISRS. Just take care not to remove commits that effect the site selection changes.
anderspitman commented 6 years ago

Sounds good!