guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Parallel download #43

Closed agalitsyna closed 6 years ago

agalitsyna commented 6 years ago

I've parallelized sra download/parsing and revived some forgotten, but useful features (e.g. silent download). I also designed tests for parallelization and run the old ones to prove that I have not introduced any errors.

Hope you find this useful. I'll be happy to accept your suggestions on how to improve this part.

guma44 commented 6 years ago

Hi, thanks a lot for this pull request. I really like the tests you added. I am in the process of reviewing and testing the code. The tests on Travis are not passing because there is no Bio module in requirements.txt file. It should be added - in the past, it was optional because most people would not use it to download SRA files anyway. This time it has to be included because tests are based on it. I also see some minor style issues but this is to be fixed in the end.

antonkulaga commented 6 years ago

@agalitsyna looks awesome, looking forward to seeing it merged!

agalitsyna commented 6 years ago

Hi, thanks a lot for the review! Can I fix the Bio module requirement and minor style issues and do pull request once more? If so, I'd be grateful for a general description of style issues.

guma44 commented 6 years ago

Of course, you can. Concerning the style, whenever I can I try to follow pep8 so when you use any linter (sublime anaconda, atom python-ide, autopep8 or pycodestyle) should give you some hints. In the code, I live some marks too.

agalitsyna commented 6 years ago

Basically, I've completed code style review (not only for parallelization part) and added biopython to the requirements. I've also fixed some minor bugs that I've encountered. I've run a setup from scratch of GEOparser on a blank linux system, both installation and tests are ok.

I avoided following some PEP8 recommendations that were already ignored in GEOparse (e.g. no capital letters in function names).

I'm confused by two things:

  1. Mixed usage of exceptions (logger.error vs GEOTypes defined exception classes). Is there any substantial logic for it? For example, I had to remove NoSRAToolkitException, because the SRA unpacking step was moved to utils.py, where there are no exception classes defined. Hope it's not a crime.

  2. Usage of e-mail is required by my tests with SRA download, so I put my unused e-mail for that purpose in the code. Is it convenient?