rice-solar-physics / pydrad

Python tools for setting up HYDRAD runs and parsing output
https://pydrad.readthedocs.io
MIT License
4 stars 3 forks source link

Initial parse is extremely slow #132

Closed jwreep closed 3 years ago

jwreep commented 3 years ago

I'm not sure what the cause is here. When initial creating a Strand, the code can take a minute or two on some simulations. Presumably it's reading all of the files at first? Is this necessary?

from pydrad.parse import Strand
s = Strand('path/to/HYDRAD/')

Does the code read every file in the Results directory? This is extremely slow if the there are lots of output files.

wtbarnes commented 3 years ago

Every time a strand is initialized, the time array is parsed:

https://github.com/rice-solar-physics/pydrad/blob/73a2d807aa5919e870a0af9d8c7738274821febe/pydrad/parse/parse.py#L49

This goes through every single AMR file and grabs the time:

https://github.com/rice-solar-physics/pydrad/blob/73a2d807aa5919e870a0af9d8c7738274821febe/pydrad/parse/parse.py#L24-L31

As far as I can tell, there is no way around this as the time array is not stored anywhere else. Note that this is only done once though, even after slicing a strand (e.g. when you do s[1:10]). If HYDRAD stored the time array in a single location, then this could be avoided.

Note that you get around this by parsing the time once (with the get_master_time function), storing it in a file (or in memory), and then passing it in as a kwarg when you create the strand,

my_master_time = get_master_time('/path/to/my/sim')  # store this in a file or something
Strand('/path/to/my/sim', master_time=my_master_time)
jwreep commented 3 years ago

This seems somewhat cumbersome. The data is output in a fixed pattern, so the time array shouldn't need to be read in principle. As long as one isn't mixing and matching simulations there are two possible alternatives, I think.

  1. Read the time array from HYDRAD/config/hydrad.cfg -- this file stores both the final time printed and the cadence at which the time is output. This would be quick and simple, but if the simulation hasn't finished, it might not be accurate. This could be checked when parsing, and fall back to reading all of the .amr files.

  2. Read only the first and last profile in the Results directory, and make an assumption that the cadence of output has been constant the whole way through the simulation (I personally have never changed it mid-simulation). Read the config file to get the cadence.

wtbarnes commented 3 years ago

That's fair. I think one issue may be though that the resulting times are not going to be exactly those derived assuming constant cadence and a start and end time. For example, I just looked at the first few lines of a results file I had profile9.amr where the output cadence was 10 s and the total simulation time was 1 day. So Profile 9 should have the time 90 s. However, this is not quite the case

90.0005
9
6.7353609264214478e+09
411

One easy alternative would be to create the master time array once, save it to a file inside the HYDRAD results directory, and then prioritize reading the time from that file so that parsing the Strand is only slow once rather than being slow every time. If that file is not present, it just falls back to reading it from each AMR file.

sjbradshaw commented 3 years ago

You can safely assume that 90 s = 90.0005 s. HYDRAD doesn’t necessarily land on exactly the value of the time implied by file number * cadence because of the varying size of the time-steps. However, since the time-steps are typically << cadence then this isn’t really a problem.

From: Will Barnes notifications@github.com Sent: Thursday, December 10, 2020 11:26 AM To: rice-solar-physics/pydrad pydrad@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [rice-solar-physics/pydrad] Initial parse is extremely slow (#132)

That's fair. I think one issue may be though that the resulting times are not going to be exactly those derived assuming constant cadence and a start and end time. For example, I just looked at the first few lines of a results file I had profile9.amr where the output cadence was 10 s and the total simulation time was 1 day. So Profile 9 should have the time 90 s. However, this is not quite the case

90.0005 9 6.7353609264214478e+09 411

One easy alternative would be to create the master time array once, save it to a file inside the HYDRAD results directory, and then prioritize reading the time from that file so that parsing the Strand is only slow once rather than being slow every time. If that file is not present, it just falls back to reading it from each AMR file.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rice-solar-physics/pydrad/issues/132#issuecomment-742669317 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC6C7XIYNZDVFBGL2GQRRTSUEAEFANCNFSM4UVG5PIQ . https://github.com/notifications/beacon/ACC6C7XRZFENKOSVN7WCPODSUEAEFA5CNFSM4UVG5PI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFRCDYBI.gif