Open anderspitman opened 7 years ago
It's the test data you used. Run on the same system, so it can't be a version issue (Save for Py2 v. Py3).
Actually it's probably just Py2 vs Py3, because legacy SISRS has to use Py2 and sisrs-python has to use Py3, and there are definitely going to be differences. Theoretically the differences should not have a semantic impact, since they only involve sorting. However, at this early stage we should really confirm I'm not missing something. Can you try running legacy SISRS again but using the bash file from PR #57?
I will give a crack tomorrow. I'm adding the streaming bits now so it will be interesting if the effect goes away...
Relatively easy fix. There was an issue with how bases were being assigned to the species specific genomes and it came down to Py2/Py3 differences in how ties were handled (5 As vs. 5Ts for instance). I have updated the code in my Python and Bash version, and the pytest scripts all pass, and results are identical downstream. This minor hiccup underlies the importance of using mappers that can handle degenerate bases in the reference as a good hit. I have made a note to look into other mappers on an issue I self-assigned.
Hello all,
I have finished up the first round of code swapping into Anders' new Python 3 code.
https://github.com/BobLiterman/SISRS/tree/RAL_Python
Major things I did:
1) Implemented streaming-type pipelines, which don't require multiple large pickled files to be created or loaded 2) Simplified code by removing extraneous bits (e.g. Reference genome data in outputAlignment) 3) Added some nice print messages for the user. 4) Allowed the use of zipped sequence files (.fastq/.fq/.fastq.gz/.fq.gz) 5) Deleted paired end identification (SISRS is completely single ended for the moment).
This current version (save for some intentional folder structure changes) has identical output to my Bash version.
I tried to comment around where I could, so hopefully things are more or less clear. When the assemblers are ready to test, that will take more effort on the multi-node side of things.
Re: Docker:
I had good success loading the program through Docker and running it. For me, the recommended developer instructions seemed problematic, with Git and Docker co-tracking things and working in a sort of Inception style multi-shell mounted framework.
I wound up just making a single alias on my system to cd into the SISRS directory, run setup.py, and cd back to where I was. Code was updated for instant running without lots of extra steps. But that's me.
This all sounds great! A couple questions:
With A/T tie thing, I ran into the same issue. Was it in specific_genome.py
by chance, or another place?
For the developer instructions, I agree that it is conceptually complicated. Can you give some more detail about the steps you go through to get up and running? It might be worth sacrificing some flexibility in order to simplify the process. Then if developers who are more familiar with Docker want to they can always do a more complex setup like the one I currently have outlined.
Also, if you've changed around the directory layout for the output, we'll need to carefully consider how to update the test data to ensure we haven't introduced any regressions.
The A/T was in specific genome, yes.
All that I did to run it was:
1) Clone 2) setup.py install 3) run sisrs-python <> until I broke something. Edit code as needed 4) run pySISRS (cd to SISRS, run setup.py, cd back to where I was) 5) repeat
In terms of data structure, there were a few material changes:
1) Two new files are created in the genome folder: contigs_seqLength.tsv and contigs_LocList (or something like that) 2) No pickles are generated, they are replaced with LocList files for each species. 3) All the alignments just go into the main output folder (alignment.nex, alignment_pi.nex, alignment_bi.nex). Previously, I had subfolders for _bi and _pi, but I just made it so the output files had slightly more descriptive names.
The alignment files themselves are also different, as I removed the reference genome options entirely. I am going to rescue that behavior in a downstream addition.
I suspect the test data would require at least some alteration with all of these changes.
Ok it was probably the same place that gave me issues then. If you run into future Py2/Py3 problems you can take a look at PR #57 if you haven't already. Those were all the ones I found while moving to Py3.
I'm assuming you did all that from within the Docker container, right? Are you editing the code from within Docker as well? The only difference I see with what you're doing is you're not using pip install -e
. I highly recommend doing it that way. It's a very common way to develop Python code and saves you having to explicitly run setup.py every time you make a change. You can read more about how it works here:
https://pip.pypa.io/en/latest/reference/pip_install/#editable-installs
https://stackoverflow.com/questions/30306099/pip-install-editable-vs-python-setup-py-develop
We'll definitely need to update the tests and test data to reflect those changes. Shouldn't be too big of an issue, we just need to be really careful.
For the record, I didn't end up doing much in the Docker. I was just running sisrs-python on my system, as I have all the pre-reqs.
That makes sense. Maybe I update the README to mention the Docker container as an option for getting the pre-reqs but remove all the directory mapping stuff?
Yeah. The Docker setup was a snap with those. I'd leave it at that.
Also I just thought of another question. If you're merging your code into the Python_Port branch, how is it that you ran into the same A/T problem that I did in specific_genome.py? The version of the file on that branch already accounts for the ordering problem. Was it just that you had to update your bash sisrs in order to get the same results?
The sort that you had in there was actually unsorting a list I think. sorted() was called on a list that was already sorted by read count depth, which was then being resorted by base I think. It showed up in the subsequent Bowtie runs (the Python had fewer reads mapping than bash).
While munging around in there, I actually found a test case that we hadn't considered biologically and it led me to reshape that step to ensure optimal mapping.
@BobLiterman I'm just realizing when I read your last message before I didn't fully process it. If my code was unsorting something that was supposed to be sorted, that's a pretty large cause for concern. It means our current test data is invalid, because it was generating using bash SISRS from PR #57.
There is one niggling issue that I'd like to address prior to generating new test data and that is the allowance of both compressed and uncompressed read files. Most people do not work on raw FASTQ files, but rather gzipped files. The fix was easy up top when detecting directories with data, but there was another FASTQ-only filter in the Bowtie script that I want to replace.
That shouldn't take me very long to splice into the code, and from that point, I feel like we can generate new test data from my newest iteration of SISRS (post-assembly).
Sounds good. As long as you're confident the Python version is working, I'm fine with that plan. I think maybe I've been a bit paranoid wanting to make sure my port was 100% matching the output of the bash version, but that's mostly because I lack the background to know whether it's doing what it's supposed to.
@BobLiterman what's your status? I recently completed a (mostly untested) implementation of the assembler plugin system, including plugins for velvet and minia. I think that pretty much wraps up the major functionality I wanted to get into place for the initial Python release.
As of right now my branch is still passing the tests, but it won't if I merge in the latest Python_Port branch which includes your RAL additions. Do we want to go ahead with the merges, add your gzip code, then generate fresh test data? As I said before, if you're comfortable verifying that everything is still working, that's certainly an easy option for me. I mean even following strict practices before I let invalid test data in with the sort stuff, due to my lack of domain knowledge.
We're getting really close!
I feel good about my changes as they stand now, and will generate new benchmark runs on the same test dataset with my Python version today.
Hot off the presses, @anderspitman.
Generated from my fork, commit 696df201c79e186673f1f38eea9ae7a66cd839ec
https://github.com/BobLiterman/SISRS/tree/RAL_Python/Test_Data_04192018
I have a log for each step, and a start-to-finish command list, like the previous run.
Awesome! I will update the tests to reflect that data, then merge our branches and get the final PR in.
Hey @BobLiterman, sorry for the late feedback. School has been brutal this week. I'm working on resolving merge conflicts now. It looks like the only real one is in sisrs.py. I see you made significant changes to the DirectoryLists
class, specifically removing the paired and unpaired lists in order to handle the gzip stuff. I'm using those lists in some of the commands I've implemented since you merged my branch. Would it break anything for me to add them back in?
Nope. Not really. Have at it!
@BobLiterman a few issues/questions:
Sounds good!
@reedacartwright and I wanted to start a conversation about possibly porting the bash portions of SISRS to Python. We both feel this would make it easier to maintain in the long run, but the work to do it may certainly be nontrivial. This is something I could possibly do myself or at least come up with a process whereby we all do it incrementally. Initial thoughts @rachelss?