sort code needs code review

jazdrv commented 6 years ago

concerns:

is this architecture a good approach? (ie: should it all be replaced by sqlite)
what features overlap with the way Jef's doing his code, and anticipates doing his code?
the way data is being collected and discerning what is a POS vs. NEG call, is this all being done right?
how should genotype in vcfcalls be handled?
is getVCFvariants.sh working right?
what arch changes should happen?
what code syntax changes should happen?
what OO model changes should happen?
what redux.py calls should change?
db1.py -- can ideas be merged into db.py?
what unit testing strategies can be used?
what SQL shifts need to happen?
what prevents this code from being ready for pull requesting and merging?
this arch approach relies on h5py saving out the numpy matrix and recalling it. are there issues with this approach? if so, should other data also be saved to h5py like this? could this syncing btw sqlite and numpy be a problem potentially workflow-wise?

jazdrv commented 6 years ago

also, naming conventions for vars, routines, sql tables
also, table/views/index ddl object usage - including temp tables
also, what commenting concerns are there?
global var usage, object attribute usage concerns?
panda usage concerns? (vs. numpy, sqlite, etc)
feature ideas?

jeftreece commented 6 years ago

It's going to take me a long time before I can much comment on the sort algorithms. I only have a few observations so far

There are no comments. I'm no model for great comments, but I can see the impact of having none, when trying to understand what the code is supposed to be doing
getVCFvariants should be part of the data load work package. It's probably not quite correct yet, but I don't see it as part of sort
genotype in vcfcalls, as we've discussed, is not desirable. Parsing VCF files and storing the data is one thing, and analysis is another thing. The interface for genotype is get_variant_array or thereabouts
there should not be a db1.py. Make the changes you need in db.py

jeftreece commented 6 years ago

I have gotten further with this review but still have more to go. Here are some things so far.

Approach to getting this merged with master is murky. I have previously suggested a strategy, and I'll reiterate it here. In slack, Feb 20, in #coding-database, Zak asked about the approving pull requests. I answered. I thought that was clear now, if unfamiliar. "this process is new to me, I'm afraid. so silly question ... the idea here is that I just review the code changes in the browser? [do I approve it?]" "yes. I don't think we've really formalized any process (and that's OK with me at least for now) but the idea is changes get reviewed by someone before going into master and ideally any issues identified in the review get discussed/resolved before it gets delivered to master"

Because there are changes to files like lib.py and db.py versus latest on master, I don't think I'm really reviewing the correct thing. Until it's in a pull request or diff off of master latest, I can't be very precise with my review comments.

March 13, I suggested a strategy: "What I think needs to happen is Zak prepare incremental branches for delivery to master that he can do a pull request on and we can review properly."

I recommend breaking this up into incremental deliveries. It can be done as one large delivery, but I don't recommend that because it's quite a bit more difficult to address review comments and make sure the code is all caught up. It's also harder for the reviewer. This is just basic, not specific to this project/branch. There's more than one way to accomplish this, but and here's one possible workflow:

pull from master, resolving any merge conflicts, so there's no regression from recent fixes on master and you're as current as possible
create a new branch
copy limited-scope changes into the new branch, starting perhaps with some low-level changes
do a pull request in master for the new branch; requesting review of the smaller change
handle review comments in the pull request
repeat until master has all of the necessary changes

So far, here are some review comments:

redux.py may not be the place to put all of these command arguments. Refer to earlier discussion with Harald, where I thought we all agreed to have few command-line args to the main driver and standalone utilities as needed.
dynamic sql binding must be used for queries. Otherwise, we're exposed on sql injection attacks. We may not care now since it's all single user sqlite, but this could easily come back to bite us as a security vulnerability when we move to mysql and maybe have a server somewhere with data on it. Must fix things like select ... where snpname = '%s' ..., using the proper construct db.execute('select ... where snpname=? ...', (var,))
environment variable needed at runtime needs to be fixed. Refer to prior slack discussion where you agreed (sort of "It's doable") to one env var, pointing to the source directory. In the source dir, you can ". ./redux.env" to set this
db1.py probably shouldn't exist. If you really need a different interface, I'd recommend sub-classing DB or else fixing whatever's needed in db.py (make sure to fix any place that is already using DB apis). Keep in mind - sqlite is not the end goal, so keeping this simple helps future portability to other DBM
preference to use the Trace class rather than print statements. Trace is not the end goal, likely logging is. See Issue #36 which may not be done immediately, but having a consistent logging call across the code will make the job easier in the future.
One thing I'm still working out is the extra columns added to vcfcalls. I'd much rather fix get_array_api than change the underlying low level storage. I know you said you don't understand this routine, but it's a very simple routine that walks through calls and returns values in a python data structure. These are not difficult concepts.

I'll probably have more comments later. Resolving these in an issue is not the way I'd prefer. I'd prefer to resolve review comments via a pull request.

moregubbins commented 6 years ago

Hi Jef, Zak, I think most of Jef's comments are already acknowledged, particularly #1, #3, #4 and #5, particularly since Zak has been working on this as a "proto-type", so the integration and quality of the code presumably isn't up to as high a quality as would otherwise be the case. Nevertheless, the outputs do show that this is roughly the logic principles that we need. I'd prefer to see what Zak's done as a measure that lets us prove what we've got there is viable, but I think we all know it will take some migration to get it properly integrated. My question you both is whether you think you can do this together, let's say over the next two weeks, with minimal input from me? Zak: are you able to create one or more intermediate branch points as Jef suggests? Jef: are you prepared to take charge of organising and performing this merger? I'll address Zak's e-mail separately, because it deserves a detailed response. Cheers, Iain.

  From: jeftreece <notifications@github.com>

To: jazdrv/dnaTools dnaTools@noreply.github.com Cc: moregubbins gubbins@talk21.com; Assign assign@noreply.github.com Sent: Thursday, 22 March 2018, 0:16 Subject: Re: [jazdrv/dnaTools] sort code needs code review (#19)

I have gotten further with this review but still have more to go. Here are some things so far.Approach to getting this merged with master is murky. I have previously suggested a strategy, and I'll reiterate it here. In slack, Feb 20, in #coding-database, Zak asked about the approving pull requests. I answered. I thought that was clear now, if unfamiliar. "this process is new to me, I'm afraid. so silly question ... the idea here is that I just review the code changes in the browser? [do I approve it?]" "yes. I don't think we've really formalized any process (and that's OK with me at least for now) but the idea is changes get reviewed by someone before going into master and ideally any issues identified in the review get discussed/resolved before it gets delivered to master"Because there are changes to files like lib.py and db.py versus latest on master, I don't think I'm really reviewing the correct thing. Until it's in a pull request or diff off of master latest, I can't be very precise with my review comments.March 13, I suggested a strategy: "What I think needs to happen is Zak prepare incremental branches for delivery to master that he can do a pull request on and we can review properly."I recommend breaking this up into incremental deliveries. It can be done as one large delivery, but I don't recommend that because it's quite a bit more difficult to address review comments and make sure the code is all caught up. It's also harder for the reviewer. This is just basic, not specific to this project/branch. There's more than one way to accomplish this, but and here's one possible workflow:

pull from master, resolving any merge conflicts, so there's no regression from recent fixes on master and you're as current as possible
create a new branch
copy limited-scope changes into the new branch, starting perhaps with some low-level changes
do a pull request in master for the new branch; requesting review of the smaller change
handle review comments in the pull request
repeat until master has all of the necessary changes So far, here are some review comments:
redux.py may not be the place to put all of these command arguments. Refer to earlier discussion with Harald, where I thought we all agreed to have few command-line args to the main driver and standalone utilities as needed.
dynamic sql binding must be used for queries. Otherwise, we're exposed on sql injection attacks. We may not care now since it's all single user sqlite, but this could easily come back to bite us as a security vulnerability when we move to mysql and maybe have a server somewhere with data on it. Must fix things like select ... where snpname = '%s' ..., using the proper construct db.execute('select ... where snpname=? ...', (var,))
environment variable needed at runtime needs to be fixed. Refer to prior slack discussion where you agreed (sort of "It's doable") to one env var, pointing to the source directory. In the source dir, you can ". ./redux.env" to set this
db1.py probably shouldn't exist. If you really need a different interface, I'd recommend sub-classing DB or else fixing whatever's needed in db.py (make sure to fix any place that is already using DB apis). Keep in mind - sqlite is not the end goal, so keeping this simple helps future portability to other DBM
preference to use the Trace class rather than print statements. Trace is not the end goal, likely logging is. See Issue #36 which may not be done immediately, but having a consistent logging call across the code will make the job easier in the future.
One thing I'm still working out is the extra columns added to vcfcalls. I'd much rather fix get_array_api than change the underlying low level storage. I know you said you don't understand this routine, but it's a very simple routine that walks through calls and returns values in a python data structure. These are not difficult concepts. I'll probably have more comments later. Resolving these in an issue is not the way I'd prefer. I'd prefer to resolve review comments via a pull request.— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jeftreece commented 6 years ago

I'm not sure what might still be open here, but I'm closing this issue with merge #38

jazdrv / dnaTools

sort code needs code review #19