Closed asreimer closed 4 years ago
Everything is done and working great now.
BUT, RTI plotting doesn't work for some reason. On line 293 of rti.py, the myBeam object is always None, even if one passes in their own myPtr for myFile... Not sure why this is happening.
Can someone play around with the code?
Seems as though this reintroduces errors fixed in #246 and #247. Do you want to resync from develop and try again?
Ah yes, I just didn't do a pull from develop. I've done that now.
This hasn't fixed the RTI plotting issue. I'm looking into it, but if someone else figures out the problem I won't complain! :)
To help anyone who wants to debug the RTI problem, this is the code I'm running:
from davitpy import pydarn
from datetime import datetime
pydarn.plotting.plot_rti(datetime(2012,9,22),'rkn',eTime=datetime(2012,9,22,3))
And I get this error:
ERROR:root:no data available for the requested time/radar/filetype combination
Ah, it's actually a simple problem. It has to do with specifying a beam number (the bmnum
argument in radDataOpen). I removed a while loop in radDataTypes.py in the readRec function that I shouldn't have removed.
This pull request is now finished (short of someone finding a bug).
Hey @asreimer,
I've been having trouble getting anything going with this branch. Here's what I can say I've done so far. I've removed the build/, dist/, and davitpy.egg-info/ directories where my code is as well as removed /usr/local/lib/python2.7/dist-packages/davitpy-0.6* directory since I run the install script without specifying a user.
However, when calling a pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
with all the variables filled in I get an error message of:
ERROR:root:cannot import name parse_dmap_format_from_file Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 409, in __init__ self.eTime) File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 774, in __validate_fetched from davitpy.pydarn.dmapio import parse_dmap_format_from_file ImportError: cannot import name parse_dmap_format_from_file
It seems as though there is some kind of issue with importing a module?
Mmmm, I had a look into this a little more. @asreimer, it seems as though the dmap __init__.py
file tries to import everything from pydmap. And then looking into the pydmap directory, the __init__.py
file there is one line which tries to import dmap.
So, it seems as though there's code that's not listed here...what gives? Did something break in the git-fu?
Ok, the pydmap directory is empty except for the __init__.py
file... I guess I never commited the files. Whoops! And sorry about that!
I've commited the necessary file now so everything should be working on your end.
Wish I had better news.... Now getting this error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 409, in __init__
self.eTime)
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 786, in __validate_fetched
records = parse_dmap_format_from_file(f)
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 1081, in parse_dmap_format_from_file
dm = RawDmapRead(filepath)
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 207, in __init__
self.test_initial_data_integrity()
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 234, in test_initial_data_integrity
code = self.read_data('i')
File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 461, in read_data
data = struct.unpack_from(data_type_fmt,self.dmap_bytearr,self.cursor)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray
It seems like there's something odd about how the data_type_fmt
variable is being set. If the line numbers on the traceback are right, it seems like line 234 should be passing it and i. Any thoughts here?
What did you do to get this error?
Same as I previously posted:
try: myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
Where sDate/eDate are sequential datetime objects, and radar is a three letter radar code. As well:
remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
On a different computer than the one I developed this pull request on I'm able to plot RTIs and read from any file that I've tested. Seems like the specific file you are testing is causing the error. What are sDate, radar, eDate? For example, this works:
from davitpy import pydarn
from datetime import datetime
pydarn.plotting.plot_rti(datetime(2012,9,22),'rkn',eTime=datetime(2012,9,22,3))
Or, more simply....I get the same thing if I do:
import datetime
from davitpy import pydarn
sDate=datetime.datetime(2014,7,8)
eDate=datetime.datetime(2014,7,9)
radar='bks'
remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
Weird. I ran the same code as you did here and I didn't get the same error :(
Maybe try deleting dist, build, and davitpy.egg again and reinstalling?
sudo rm -r build/
sudo rm -r dist/
sudo rm -r davitpy.egg-info/
sudo rm -r /usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/
sudo python setup.py install
Get the same error whether doing your plot_rti
example, or my simple radDataOpen
example. This is running on Ubuntu 12.04 server (so no GUI). I'll try it tomorrow on another linux computer that has a regular Ubuntu on it.
Hmm... I've run your code snippet on Ubuntu 14.04, Fedora 23, and OpenSUSE 13.4 and haven't ran in to this problem... Really weird. GUI or not shouldn't matter here.
The only thing I can think of is that you could try setting noCache=True keyword on radDataOpen. Perhaps a file you downloaded got corrupted for some reason.
Must be an issue with an older implementation of struct.unpack. its not happy with passing it a bytearray as a buffer. I feel like this fix may solve that issue:
http://stackoverflow.com/questions/15467009/struct-unpack-from-doesnt-work-with-bytearray
when it says that python 3 lifts that issue, its also in later versions of python 2 as thats what I wrote this code using.
On Mon, May 9, 2016 at 8:16 PM, Ashton Reimer notifications@github.com wrote:
Hmm... I've run your code snippet on Ubuntu 14.04, Fedora 23, and OpenSUSE 13.4 and haven't ran in to this problem... Really weird. GUI or not shouldn't matter here.
The only thing I can think of is that you could try setting noCache=True keyword on radDataOpen. Perhaps a file you downloaded got corrupted for some reason.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vtsuperdarn/davitpy/pull/249#issuecomment-218042113
-Keith Kotyk
@kkotyk, @asreimer,
It seems as though this is an issue with the latest version of python. I tried this branch on another computer that has a GUI and still got the same thing, but both of them are running Python 2.7.3. My instinct to try to get my python to the latest version is to do:
sudo apt-get update
sudo apt-get install python2.7
However, on both Ubuntu installs, it says that this it the latest version. I believe these are both Ubuntu 12. I have been able to google on ways to more manually upgrade python as seen here. This seems a little rigorous for the average user. Though, counter to that, I understand Ubuntu 12 is about to be 4 years old. Several of our computers at VT are still using it though. Is there a way to make this a little more backwards compatible for us lagging behind?
I logged onto an Ubuntu 14 computer, and it seems as though it has python 2.7.6. I did not try this branch though as it's not my computer.
Hey @ksterne I'm getting an Ubuntu 12.04 VM going to play around with this. I'm going to try the buffer() fix that @kkotyk suggested. If that works for all python 2.7 versions, then we're good to go. I'll update yall when I get this finished (could take a day or two).
Alright it took me a bit longer to do this (check #245 for the reasons why). I can reproduce the error you are having @ksterne. I have a VM running a fresh install of Ubuntu 12.04 with everything (python packages and Ubuntu packages) updated to the latest. It is running python 2.7.3 and I get the same
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray
Next step is to try @kkotyk's suggestion for the fix.
Yup, that fixed it.
@ksterne, that problem is now gone. Have a look! :)
@kkotyk, @asreimer, Sorry I must have read through the post a little too quickly. I did not realize the fix was such an easy/simple one to make. My time to work on some things has been shorter recently.
Does this change break things for the newer versions of python? If they're backwards compatible I wouldn't think so.
A first couple of test runs, this is working like a charm. I'll crank up my big data pull/processing script and see if that breaks things. Thanks again!
Ok, here's another question which maybe is a dumb one. if I continue my example a little bit before of:
import datetime
from davitpy import pydarn
sDate=datetime.datetime(2014,7,8)
eDate=datetime.datetime(2014,7,9)
radar='bks'
remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
but then add myPtr.close()
I get a complaint that there isn't a .close for radDataPtr. I'm guessing this functionality has gone away, but do we still need it? Again this is where this may be a dumb question.
I should have included that in my notes. I removed the open() and close() methods. They don't do anything since dmap.py handles opening and closing the files now.
I have added them again with a deprecation warning.
Thanks for adding the deprecation warning in. I'm testing out how that is working as we speak, it seems like it's working OK.
So far, this seems to fix the issue of some kind of memory or data leak while doing statistical or large data pulls. I was able to run through all of August 2014 for all of the radars using fitacf files without things breaking.
However, there is a significant difference in the amount of time it takes to run through the files. Have you been able to quantify this @kkotyk or @asreimer? For the run of August 2014 data, it took 2 days to run through all of the files. There's some debate here on the practicality of this for doing large statistics. Granted, I'm not multi-threading here, so that could help speed things up.
As well, it seems as though this code changes how the concatenated file is addressed? It seems as though I'm not getting any concatenated files in my /tmp/sd/ directory. I am seeing the compressed and uncompressed files though. Is there a reason both of them are being kept around?
Hmmm it is much slower isn't it. In fact it's 10x slower as revealed using this test code:
def test_read_speed_pydmap():
from datetime import datetime
from davitpy.pydarn.dmapio import parse_dmap_format_from_file
time = datetime.now()
records = parse_dmap_format_from_file('20121101.0201.00.fhe.fitacf')
print "number of records " + str(len(records))
time2 = (datetime.now()-time).total_seconds()
print "reading once took " +str(time2)+" seconds"
print "Now read 10 times"
time = datetime.now()
for i in range(10):
records = parse_dmap_format_from_file('20121101.0201.00.fhe.fitacf')
print "number of records " + str(len(records))
time2 = (datetime.now()-time).total_seconds()/10.0
print "reading 10 times took an average of " +str(time2)+" seconds per read."
def test_read_speed_rstdmap():
from datetime import datetime
from davitpy.pydarn.dmapio import readDmapRec
import os
time = datetime.now()
f = os.open('20121101.0201.00.fhe.fitacf',os.O_RDONLY)
ptr = os.fdopen(f)
records = list()
record = readDmapRec(f)
while record is not None:
records.append(record)
record = readDmapRec(f)
print "number of records " + str(len(records))
ptr.close()
time2 = (datetime.now()-time).total_seconds()
print "reading once took " +str(time2)+" seconds"
print "Now read 10 times"
time = datetime.now()
for i in range(10):
f = os.open('20121101.0201.00.fhe.fitacf',os.O_RDONLY)
ptr = os.fdopen(f)
records = list()
record = readDmapRec(f)
while record is not None:
records.append(record)
record = readDmapRec(f)
print "number of records " + str(len(records))
ptr.close()
time2 = (datetime.now()-time).total_seconds()/10.0
print "reading 10 times took an average of " +str(time2)+" seconds per read."
Note that you have to switch between the develop
branch and this pull request to test run each function. @kkotyk, do you have anything to suggest here? I tested the pydmap code at an earlier stage and it was just as fast as the RST C code, but now it is 10x slower...
And yes, there are no more concatenated files. The compressed and uncompressed files are both being kept around because the compressed files are locally cached as per fetchUtils
.
Hi @ksterne and @kkotyk
I forgot to add that part of the reason @ksterne may be experiencing slow code is that the file fetching code is slow and always checks the integrity of downloaded files. If you have files on a local machine, it doesn't make sense to check whether they downloaded correctly.
@kkotyk and I will talk about speeding pydmap up a bit if possible.
Hey @ksterne, looking at the numbers I'm a bit confused. First, do you have any timing on running your code without using pydmap?
Next, your timing numbers don't make sense to me. You said it took 48 hours to run through a month worth of data right? That means it takes 1.5 hours per day to do the processing, which makes me wonder what you are doing with the data.
@asreimer,
I think you're not getting the code I've been using here. Let me see if I can simplify it:
for each radar in the SD network
for each day in the month of X
Get the entire day's worth of data for radar 'radar' on day 'day'
Read each record and compare a variable to a known list
write a file with some stats about radar 'radar' data for X month
So here I can tell how long it takes to go through one month's worth of data for one radar by when the file is last modified. For example, for Sept. 2014, my 201409.sas file was last modified at 14:57 and my 201409.pgr file was last modified at 16:38. So then I know to go through the entire month of Sept. 2014 for pgr, it took about 1.5 hours. So, it takes about 3 minutes per day. Again, I'm doing a radDataReadRec()
to get each record and then do something very simple with that record.
Does that make more sense?
Ah ok, that makes more sense. 1.5 hours is much better. Currently in my testing it takes ~45 seconds to read 24 hours of data (this doesn't include file fetching, validation, etc.) So roughly 3 minutes per day that you are getting is more reasonable. If you now run your code again using the develop branch (which doesn't use pydmap right now) does your code take ~8x less time to run?
Also, is 8x slower really a big problem?
The file fetching and validation code in davitpy could be optimized a bit. For example, ssh connections are always opened before checking if cached files exist. And there's other silly things done that could be optimized to speed things up.
For the question of is 8x slower really a big problem, I would think so. This pull request is supposed to solve the issue of using davitpy to do statistical studies which require large data pulls. Doing large data pulls and slowing things down don't seem to go together. While I am very thankful that my code now runs without crashing and I have to restart it where it crashed, it is taking quite some time.
I've found another potentially big problem here. It seems as though some of the data types have changed going from the develop branch to this branch. Open up a fitacf data pointer and look at the data type of the variables between the two. Here's some sample code:
import datetime
from davitpy import pydarn
sDate=datetime.datetime(2014,7,8,0,0)
eDate=datetime.datetime(2014,7,8,4,0)
radar='bks'
#pydarn.plotting.plot_rti(datetime.datetime(2012,9,22),'rkn',datetime.datetime(2012,9,22,3))
remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
myData = pydarn.sdio.radDataReadRec(myPtr)
print type(myData.fit.slist)
print type(myData.fit.phi0)
On the develop branch you'll get list
whereas on this branch you'll get numpy.ndarray
. Are all the other parts changed in type? I'm not sure a good way to do this easily.
Will this change how things are done? Possibly not, but I'm running some code that was behaving before and is now struggling because of the data type change. Is there a reason these were changed to a numpy array? I get that it may make things easily to manipulate, but changing the data type without a documentation or announcement may not be good.
After some time to think about this, is the speed still an issue? If it really is, is there a point in me spending any more time on this?
I did a bit of speed testing on numpy arrays versus lists, and lists are usually significantly faster (about 2x as fast).
The scope of this is getting really large. @ksterne, would you be comfortable doing a speed-test with a local file structure (even setting up a temporary one for testing) so that any speed issues with pydmap can be solved independently of fetchUtils? We can then start tackling the fetchUtil problems separately.
We should be careful about comparing what is faster than what. There are good reasons that people use numpy
arrays instead of python lists. Of course, there are also good reasons to use lists instead of numpy
arrays too.
The potentially big problem that @ksterne alluded to isn't a problem at all. We currently don't explicitly type the data that is read from a dmap file for use in davitpy
(ie. if we had data read in from an hdf5 file, what is the type for the arrays it reads in? This would need to be explicitly converted for use in davitpy.). This could be done in radDataTypes.py
and sdDataTypes.py
and wouldn't cost much in terms of speed. If we want everything to be lists, then we can explicitly convert them to lists. EDIT: I should add that the current workflow doesn't make sense in a server environment where all data is locally accessible. Why are we creating so many cached files?
There are some issues with the usage of fetchUtils
in this pull request, but those can be easily fixed. For example, is there any particular reason that we have to check for files remotely every time we want to open a file? Perhaps we should make the workflow into this: user explicitly gets file -> user opens file with davitpy, which creates an object that has all plotting methods attached to it -> user plots data using davitpy object. For example: davitpy.fetch(datetime(), rad, fileType) -> obj = davitpy.read_files(datetime(),rad,fileType) -> obj.plot_rti()
Way off topic now... We should probably have a meeting to discuss things. Also, we should have a users meeting at some point to see what the users of davitpy actually want.
Getting back to the issue of speed here, the real problem is that it's python, and not C. There are no speed issues with pydmap
other than it's written in python. Specifically, the slowest part of the pydmap
code is the struct.unpack_from
. @kkotyk (a Computer Engineer, ie: has formal computer science training that I don't have) has already sped this up as much as he could so I'm very skeptical we can make the python based dmap file reading any faster than it already is. That being said, if someone else can figure out a speed improvement please do (besides the obvious, "use pypy instead of python").
So if we are only talking about speed here, the way I see it, we have 2 options going forward. 1) Fix the C based dmap python wrapper (basically refactor and rewrite the dmap.c
and dmapio.c
code, any takers?) 2) take the speed hit and have a purely python dmap file reading code.
If we need to discuss this in more detail, we should probably have a meeting. Currently, development on this has stalled because I'm waiting for people to either say, "Yes, the speed hit is fine and I like that our file reading/plotting code would all be in python." or "No, let's just try to fix the C code instead and make sure it can also compile on all OSes."
@aburrell, would you still like for me to do a speed test on things? I think @asreimer noted about the speed being about 10x slower with python. Maybe it's good to put some real numbers behind things.
@asreimer, I think the discussion on cached files and the like is saved for another place. The short of it is the @ajribeiro added it in at the request of someone in Europe that didn't have such a great internet connection. If you're back to 56k speeds or something like that, a lot of the time can take just transferring the files. This could be especially annoying when you're just learning davitpy and need to change one small thing, but then would need to re-download a file. Caching makes it quicker to adjust/tweak.
Anywho, what if there was a third option @asreimer? Is it possible to keep the C dmap library in the code and it's a different call or different option to use C versus python? I get that bloats things, but that maybe makes everyone happy? The default could be to use the python library and then if there's not a lot of complaints on the speed issue, or no one seems to be using the C dmap library, we could depreciate it and eventually remove it.
I've been lurking via email updates, and I just wanted to chime in and say that this is AWESOME! Great work guys!
At developer's meeting, decided to add a flag to pick python vs C version. So support both.
Seems as though we never got much farther with the development here. I'm closing this for now since this repo is being deprecated. Good news is I think this code did make it into pydarn...or at least there was some new thinking on file IO.
This pull request finally implements a python based method for reading dmap record files. It utilizes the pydmap library written by @kkotyk.
The biggest change here is that pydmap is written to read in entire files at once and return a list of dictionaries containing the data in contrast to the RST dmap methods that kept an entire file open and read records one at a time. This means all the data is read into memory at once. As such, I've removed the file concatenating code in
radDataTypes
andsdDataTypes
and replaced it with code that reads files from a file list. This means that it helps meet the request in https://github.com/vtsuperdarn/davitpy/issues/65. Reading files from a user specified file also still works.I've also removed the unnecessary scanData class as per https://github.com/vtsuperdarn/davitpy/issues/211.
To test this code, first you need to delete the build, dist, and davitpy.egg-info directories in the davitpy root directory. Then run
python setup.py install --user
as normal. Test the code by trying out all the things you like to do with davitpy. Also try runningpython radDataTypes.py
andpython sdDataTypes.py
.An important note: I have not modified the
DataTypes.py
file since it is currently not used anywhere. This file was added in the past to abstract a generalized data type class for handling various data types (for example, dmap, json, hdf5, etc).