teamtomo / starfile

STAR file I/O in Python
https://teamtomo.org/starfile/
BSD 3-Clause "New" or "Revised" License
44 stars 19 forks source link

performance issue when opening model.star file #3

Closed inter1965 closed 4 years ago

inter1965 commented 4 years ago

When coping with a star file contains thousands of sub data blocks(.e.g. 3D classification's model file), is it possible to use a mmap or stringio buffer object to feed pandas.read_csv's argument in _read_loop_data to increase the performance? Otherwise it might take hours to read a model star file. Furthermore, add an option to read just first several blocks would be a nice addition to speed up specific case. Cheers.

alisterburt commented 4 years ago

This sounds like a great idea, would you mind sending me a large RELION model star file for testing?

I will probably try use stringio because I have used these before but if you have experience with mmap and would like to have a go it would be good to compare both!

The idea about an option for just reading the first few blocks is a good one and should work with the way things are set up - I’ll look into it over the next week or so.

Cheers,

Alister

On 12 Oct 2020, at 10:02, Xhark notifications@github.com wrote:

 When coping with a star file contains thousands of sub data blocks(.e.g. 3D classification's model file), is it possible to use a mmap or stringio buffer object to feed pandas.read_csv's argument in _read_loop_data to increase the performance? Otherwise it might take hours to read a model star file. Furthermore, add an option to read just first several blocks would be a nice addition to speed up specific case. Cheers.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

inter1965 commented 4 years ago

Wow, the mail system complained about the size of the original file. Hope the zipped version works. Best, X

On Mon, Oct 12, 2020 at 9:40 PM 张皛闶 inter1965@gmail.com wrote:

Hi Dear Alisterburt, please find the attached model star file (>22MB), it takes several hours to parse. Hope there's a good workaround to deal with it. Cheers, X

On Mon, Oct 12, 2020 at 4:32 PM alisterburt notifications@github.com wrote:

This sounds like a great idea, would you mind sending me a large RELION model star file for testing?

I will probably try use stringio because I have used these before but if you have experience with mmap and would like to have a go it would be good to compare both!

The idea about an option for just reading the first few blocks is a good one and should work with the way things are set up - I’ll look into it over the next week or so.

Cheers,

Alister

On 12 Oct 2020, at 10:02, Xhark notifications@github.com wrote:

 When coping with a star file contains thousands of sub data blocks(.e.g. 3D classification's model file), is it possible to use a mmap or stringio buffer object to feed pandas.read_csv's argument in _read_loop_data to increase the performance? Otherwise it might take hours to read a model star file. Furthermore, add an option to read just first several blocks would be a nice addition to speed up specific case. Cheers.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alisterburt/starfile/issues/3#issuecomment-706969065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTUOX2S2SVTAGRSBEVXMTSKK5K3ANCNFSM4SMOU6ZQ .

inter1965 commented 4 years ago

Here's one example, hope it could help. model.zip

alisterburt commented 4 years ago

thanks!

I have a solution which reads files much faster now but it's failing a couple of tests so I need to debug a little more - hopefully I'll have it up very soon.

On my machine the new parser reads a 1 million line file split into 1000 blocks of 1000x10 tables in ~10s - I've also added the option for only reading the first N blocks in case you want to skip things as you suggested

On Tue, 13 Oct 2020 at 15:23, Xhark notifications@github.com wrote:

Here's one example, hope it could help. model.zip https://github.com/alisterburt/starfile/files/5371573/model.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alisterburt/starfile/issues/3#issuecomment-707734841, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXYBYCARNXARY5XXKIFWATSKRIETANCNFSM4SMOU6ZQ .

alisterburt commented 4 years ago

I got myself quite lost in bugs trying to finish up the optimisations so I reverted to the last working state which was already faster and pushed it to pypi (v0.2.3) - will go over it properly when I have a little more time

alisterburt commented 4 years ago

refactored, clean and working nicely now - hopefully this makes things faster for you. Please update to v0.3.1