haddocking / pdb-tools

A dependency-free cross-platform swiss army knife for PDB files.

https://haddocking.github.io/pdb-tools/

Apache License 2.0

384 stars 113 forks source link

Fix pdb_selres on piped input files #86

Closed JoaoRodrigues closed 3 years ago

JoaoRodrigues commented 3 years ago

Python's sys.stdin is not seekable apparently, so we cannot rewind the contents of the data after reading once. Running the script now one something like pdb_fetch 1ctf | pdb_selres -80: will not return anything because the file is exhausted and the seek doesn't throw and error (at least on Windows). This PR fixes this by populating an iterator as we read the file the first time. Same low-memory behavior, slightly slower.

JoaoRodrigues commented 3 years ago

Forgot to run tests... a lot of failures :)

EDIT: Now they all pass.

codecov[bot] commented 3 years ago

Codecov Report

Merging #86 into master will decrease coverage by 0.02%. The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master      #86      +/-   ##
==========================================
- Coverage   81.95%   81.92%   -0.03%     
==========================================
  Files          46       46              
  Lines        3657     3663       +6     
  Branches      763      763              
==========================================
+ Hits         2997     3001       +4     
- Misses        468      470       +2     
  Partials      192      192

Impacted Files	Coverage Δ
pdbtools/pdb_selres.py	`75.36% <75.00%> (-0.40%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7711fe1...046e281. Read the comment docs.

JoaoRodrigues commented 3 years ago

On another note. I don't know if concatenating lists of single string with itertools.chain is of any advantage compared to making a list with all lines and send the list around instead of the iterator. Why do you need an iterator that simulates a list? fh makes sense initially because you avoid the use of .readlines(), but, since you are creating the list when passing to iterchain in buffer, right now I can't see the benefit of that over a list. Because again, you are creating one list per line, instead of one single list with all the lines.

The idea is that you never build a list with all the lines, thus keeping memory use constant and quite low (you only keep one line in memory at any time). Otherwise why would we bother making an iterator at all?

joaomcteixeira commented 3 years ago

I understand your purpose. But I cannot understand the codes actually does what you say, I may be wrong.

buffer is returned on line 207 inside a tuple. The for loop that creates buffer is not a generator, it is exhausted normally. I may be really missing it, but I truly believe the current code stores in memory a separate list for each line, and the buffer iterator then points to those locations in memory. This is tricky Python already. If you are sure on how the code performs, go ahead. I can't see it right now, I need to set up a personal homework tasks :smile:

JoaoRodrigues commented 3 years ago

Fixed that text/bytes string issue. Thanks for pointing it out, I had some leftover stuff!

As for the generators. On line 177 buffer is defined as an empty iterator: buffer = iter([]). We then iterate over fh, exhausting it, but we do buffer = iter_chain(buffer, [line]) at each step. This is a shortcut for itertools.chain() which basically concatenates an existing iterator (buffer) with any iterable (in our case, [line]). As such, we are growing buffer as an iterator. You could do the same thing like this, but it's uglier in my opinion:

def exhaust_and_regenerate(fh, resid_list):
    for line in fh:
        ... # do stuff
        resid_list.append(...)  # this is global
        yield line

global resid_list
resid_list = []
fh = exhaust_and_regenerate(fh, resid_list)

joaomcteixeira commented 3 years ago

I see. I will see. Thanks for the example. I will run it separately and investigate it. It is a good learning exercise. I've been away from coding for more than 1 month, need to get back to shape, jeje.

Your last commits quite change a lot the initial PR, that is good. Everything seems working, I tested the problematic command.

Are you okay with the CI complaints? I don't want to merge without all green, but the complaints are very minor though :stuck_out_tongue: it might even be one of those codecov diff conflicts.

:+1:

joaomcteixeira commented 3 years ago

Sorry for the long wait merging. As discusses with @JoaoRodrigues by :phone:, merging... :tada: everyone :smile: