Replace RST dmapio with Python based dmapio

asreimer commented 8 years ago

This pull request finally implements a python based method for reading dmap record files. It utilizes the pydmap library written by @kkotyk.

The biggest change here is that pydmap is written to read in entire files at once and return a list of dictionaries containing the data in contrast to the RST dmap methods that kept an entire file open and read records one at a time. This means all the data is read into memory at once. As such, I've removed the file concatenating code in radDataTypes and sdDataTypes and replaced it with code that reads files from a file list. This means that it helps meet the request in https://github.com/vtsuperdarn/davitpy/issues/65. Reading files from a user specified file also still works.

I've also removed the unnecessary scanData class as per https://github.com/vtsuperdarn/davitpy/issues/211.

To test this code, first you need to delete the build, dist, and davitpy.egg-info directories in the davitpy root directory. Then run python setup.py install --user as normal. Test the code by trying out all the things you like to do with davitpy. Also try running python radDataTypes.py and python sdDataTypes.py.

An important note: I have not modified the DataTypes.py file since it is currently not used anywhere. This file was added in the past to abstract a generalized data type class for handling various data types (for example, dmap, json, hdf5, etc).

asreimer commented 8 years ago

Everything is done and working great now.

BUT, RTI plotting doesn't work for some reason. On line 293 of rti.py, the myBeam object is always None, even if one passes in their own myPtr for myFile... Not sure why this is happening.

Can someone play around with the code?

ksterne commented 8 years ago

Seems as though this reintroduces errors fixed in #246 and #247. Do you want to resync from develop and try again?

asreimer commented 8 years ago

Ah yes, I just didn't do a pull from develop. I've done that now.

This hasn't fixed the RTI plotting issue. I'm looking into it, but if someone else figures out the problem I won't complain! :)

asreimer commented 8 years ago

To help anyone who wants to debug the RTI problem, this is the code I'm running:

from davitpy import pydarn
from datetime import datetime
pydarn.plotting.plot_rti(datetime(2012,9,22),'rkn',eTime=datetime(2012,9,22,3))

And I get this error:

ERROR:root:no data available for the requested time/radar/filetype combination

asreimer commented 8 years ago

Ah, it's actually a simple problem. It has to do with specifying a beam number (the bmnum argument in radDataOpen). I removed a while loop in radDataTypes.py in the readRec function that I shouldn't have removed.

asreimer commented 8 years ago

This pull request is now finished (short of someone finding a bug).

ksterne commented 8 years ago

Hey @asreimer,

I've been having trouble getting anything going with this branch. Here's what I can say I've done so far. I've removed the build/, dist/, and davitpy.egg-info/ directories where my code is as well as removed /usr/local/lib/python2.7/dist-packages/davitpy-0.6* directory since I run the install script without specifying a user.

However, when calling a pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt) with all the variables filled in I get an error message of:

ERROR:root:cannot import name parse_dmap_format_from_file Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 409, in __init__ self.eTime) File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 774, in __validate_fetched from davitpy.pydarn.dmapio import parse_dmap_format_from_file ImportError: cannot import name parse_dmap_format_from_file

It seems as though there is some kind of issue with importing a module?

ksterne commented 8 years ago

Mmmm, I had a look into this a little more. @asreimer, it seems as though the dmap __init__.py file tries to import everything from pydmap. And then looking into the pydmap directory, the __init__.py file there is one line which tries to import dmap.

So, it seems as though there's code that's not listed here...what gives? Did something break in the git-fu?

asreimer commented 8 years ago

Ok, the pydmap directory is empty except for the __init__.py file... I guess I never commited the files. Whoops! And sorry about that!

I've commited the necessary file now so everything should be working on your end.

ksterne commented 8 years ago

Wish I had better news.... Now getting this error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 409, in __init__
    self.eTime)
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/sdio/radDataTypes.py", line 786, in __validate_fetched
    records = parse_dmap_format_from_file(f)
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 1081, in parse_dmap_format_from_file
    dm = RawDmapRead(filepath)
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 207, in __init__
    self.test_initial_data_integrity()
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 234, in test_initial_data_integrity
    code = self.read_data('i')
  File "/usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/davitpy/pydarn/dmapio/pydmap/dmap.py", line 461, in read_data
    data = struct.unpack_from(data_type_fmt,self.dmap_bytearr,self.cursor)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray

It seems like there's something odd about how the data_type_fmt variable is being set. If the line numbers on the traceback are right, it seems like line 234 should be passing it and i. Any thoughts here?

asreimer commented 8 years ago

What did you do to get this error?

ksterne commented 8 years ago

Same as I previously posted:


    try: myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)

Where sDate/eDate are sequential datetime objects, and radar is a three letter radar code. As well:

remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']

asreimer commented 8 years ago

On a different computer than the one I developed this pull request on I'm able to plot RTIs and read from any file that I've tested. Seems like the specific file you are testing is causing the error. What are sDate, radar, eDate? For example, this works:

from davitpy import pydarn
from datetime import datetime
pydarn.plotting.plot_rti(datetime(2012,9,22),'rkn',eTime=datetime(2012,9,22,3))

ksterne commented 8 years ago

Or, more simply....I get the same thing if I do:

import datetime
from davitpy import pydarn

sDate=datetime.datetime(2014,7,8)
eDate=datetime.datetime(2014,7,9)
radar='bks'

remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)

asreimer commented 8 years ago

Weird. I ran the same code as you did here and I didn't get the same error :(

Maybe try deleting dist, build, and davitpy.egg again and reinstalling?

ksterne commented 8 years ago

sudo rm -r build/
sudo rm -r dist/
sudo rm -r davitpy.egg-info/
sudo rm -r /usr/local/lib/python2.7/dist-packages/davitpy-0.6-py2.7-linux-x86_64.egg/
sudo python setup.py install

Get the same error whether doing your plot_rti example, or my simple radDataOpen example. This is running on Ubuntu 12.04 server (so no GUI). I'll try it tomorrow on another linux computer that has a regular Ubuntu on it.

asreimer commented 8 years ago

Hmm... I've run your code snippet on Ubuntu 14.04, Fedora 23, and OpenSUSE 13.4 and haven't ran in to this problem... Really weird. GUI or not shouldn't matter here.

The only thing I can think of is that you could try setting noCache=True keyword on radDataOpen. Perhaps a file you downloaded got corrupted for some reason.

kkotyk commented 8 years ago

Must be an issue with an older implementation of struct.unpack. its not happy with passing it a bytearray as a buffer. I feel like this fix may solve that issue:

http://stackoverflow.com/questions/15467009/struct-unpack-from-doesnt-work-with-bytearray

when it says that python 3 lifts that issue, its also in later versions of python 2 as thats what I wrote this code using.

On Mon, May 9, 2016 at 8:16 PM, Ashton Reimer notifications@github.com wrote:

Hmm... I've run your code snippet on Ubuntu 14.04, Fedora 23, and OpenSUSE 13.4 and haven't ran in to this problem... Really weird. GUI or not shouldn't matter here.

The only thing I can think of is that you could try setting noCache=True keyword on radDataOpen. Perhaps a file you downloaded got corrupted for some reason.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vtsuperdarn/davitpy/pull/249#issuecomment-218042113

-Keith Kotyk

ksterne commented 8 years ago

@kkotyk, @asreimer,

It seems as though this is an issue with the latest version of python. I tried this branch on another computer that has a GUI and still got the same thing, but both of them are running Python 2.7.3. My instinct to try to get my python to the latest version is to do:

sudo apt-get update sudo apt-get install python2.7

However, on both Ubuntu installs, it says that this it the latest version. I believe these are both Ubuntu 12. I have been able to google on ways to more manually upgrade python as seen here. This seems a little rigorous for the average user. Though, counter to that, I understand Ubuntu 12 is about to be 4 years old. Several of our computers at VT are still using it though. Is there a way to make this a little more backwards compatible for us lagging behind?

I logged onto an Ubuntu 14 computer, and it seems as though it has python 2.7.6. I did not try this branch though as it's not my computer.

asreimer commented 8 years ago

Hey @ksterne I'm getting an Ubuntu 12.04 VM going to play around with this. I'm going to try the buffer() fix that @kkotyk suggested. If that works for all python 2.7 versions, then we're good to go. I'll update yall when I get this finished (could take a day or two).

asreimer commented 8 years ago

Alright it took me a bit longer to do this (check #245 for the reasons why). I can reproduce the error you are having @ksterne. I have a VM running a fresh install of Ubuntu 12.04 with everything (python packages and Ubuntu packages) updated to the latest. It is running python 2.7.3 and I get the same TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray

Next step is to try @kkotyk's suggestion for the fix.

asreimer commented 8 years ago

Yup, that fixed it.

@ksterne, that problem is now gone. Have a look! :)

ksterne commented 8 years ago

@kkotyk, @asreimer, Sorry I must have read through the post a little too quickly. I did not realize the fix was such an easy/simple one to make. My time to work on some things has been shorter recently.

Does this change break things for the newer versions of python? If they're backwards compatible I wouldn't think so.

A first couple of test runs, this is working like a charm. I'll crank up my big data pull/processing script and see if that breaks things. Thanks again!

ksterne commented 8 years ago

Ok, here's another question which maybe is a dumb one. if I continue my example a little bit before of:

import datetime
from davitpy import pydarn

sDate=datetime.datetime(2014,7,8)
eDate=datetime.datetime(2014,7,9)
radar='bks'

remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)

but then add myPtr.close() I get a complaint that there isn't a .close for radDataPtr. I'm guessing this functionality has gone away, but do we still need it? Again this is where this may be a dumb question.

asreimer commented 8 years ago

I should have included that in my notes. I removed the open() and close() methods. They don't do anything since dmap.py handles opening and closing the files now.

I have added them again with a deprecation warning.

ksterne commented 8 years ago

Thanks for adding the deprecation warning in. I'm testing out how that is working as we speak, it seems like it's working OK.

So far, this seems to fix the issue of some kind of memory or data leak while doing statistical or large data pulls. I was able to run through all of August 2014 for all of the radars using fitacf files without things breaking.

However, there is a significant difference in the amount of time it takes to run through the files. Have you been able to quantify this @kkotyk or @asreimer? For the run of August 2014 data, it took 2 days to run through all of the files. There's some debate here on the practicality of this for doing large statistics. Granted, I'm not multi-threading here, so that could help speed things up.

As well, it seems as though this code changes how the concatenated file is addressed? It seems as though I'm not getting any concatenated files in my /tmp/sd/ directory. I am seeing the compressed and uncompressed files though. Is there a reason both of them are being kept around?

asreimer commented 8 years ago

Hmmm it is much slower isn't it. In fact it's 10x slower as revealed using this test code:

def test_read_speed_pydmap():

    from datetime import datetime
    from davitpy.pydarn.dmapio import parse_dmap_format_from_file

    time = datetime.now()
    records = parse_dmap_format_from_file('20121101.0201.00.fhe.fitacf')
    print "number of records " + str(len(records))
    time2 = (datetime.now()-time).total_seconds()
    print "reading once took " +str(time2)+" seconds"

    print "Now read 10 times"

    time = datetime.now()
    for i in range(10):
        records = parse_dmap_format_from_file('20121101.0201.00.fhe.fitacf')
        print "number of records " + str(len(records))
    time2 = (datetime.now()-time).total_seconds()/10.0
    print "reading 10 times took an average of " +str(time2)+" seconds per read."

def test_read_speed_rstdmap():

    from datetime import datetime
    from davitpy.pydarn.dmapio import readDmapRec
    import os

    time = datetime.now()
    f = os.open('20121101.0201.00.fhe.fitacf',os.O_RDONLY)
    ptr = os.fdopen(f)
    records = list()
    record = readDmapRec(f)
    while record is not None:
        records.append(record)
        record = readDmapRec(f)

    print "number of records " + str(len(records))
    ptr.close()
    time2 = (datetime.now()-time).total_seconds()
    print "reading once took " +str(time2)+" seconds"

    print "Now read 10 times"

    time = datetime.now()
    for i in range(10):
        f = os.open('20121101.0201.00.fhe.fitacf',os.O_RDONLY)
        ptr = os.fdopen(f)
        records = list()
        record = readDmapRec(f)
        while record is not None:
            records.append(record)
            record = readDmapRec(f)
        print "number of records " + str(len(records))
        ptr.close()
    time2 = (datetime.now()-time).total_seconds()/10.0
    print "reading 10 times took an average of " +str(time2)+" seconds per read."

Note that you have to switch between the develop branch and this pull request to test run each function. @kkotyk, do you have anything to suggest here? I tested the pydmap code at an earlier stage and it was just as fast as the RST C code, but now it is 10x slower...

And yes, there are no more concatenated files. The compressed and uncompressed files are both being kept around because the compressed files are locally cached as per fetchUtils.

asreimer commented 8 years ago

Hi @ksterne and @kkotyk

I forgot to add that part of the reason @ksterne may be experiencing slow code is that the file fetching code is slow and always checks the integrity of downloaded files. If you have files on a local machine, it doesn't make sense to check whether they downloaded correctly.

@kkotyk and I will talk about speeding pydmap up a bit if possible.

asreimer commented 8 years ago

Hey @ksterne, looking at the numbers I'm a bit confused. First, do you have any timing on running your code without using pydmap?

Next, your timing numbers don't make sense to me. You said it took 48 hours to run through a month worth of data right? That means it takes 1.5 hours per day to do the processing, which makes me wonder what you are doing with the data.

ksterne commented 8 years ago

@asreimer,

I think you're not getting the code I've been using here. Let me see if I can simplify it:

for each radar in the SD network
       for each day in the month of X
              Get the entire day's worth of data for radar 'radar' on day 'day'
              Read each record and compare a variable to a known list
       write a file with some stats about radar 'radar' data for X month

So here I can tell how long it takes to go through one month's worth of data for one radar by when the file is last modified. For example, for Sept. 2014, my 201409.sas file was last modified at 14:57 and my 201409.pgr file was last modified at 16:38. So then I know to go through the entire month of Sept. 2014 for pgr, it took about 1.5 hours. So, it takes about 3 minutes per day. Again, I'm doing a radDataReadRec() to get each record and then do something very simple with that record.

Does that make more sense?

asreimer commented 8 years ago

Ah ok, that makes more sense. 1.5 hours is much better. Currently in my testing it takes ~45 seconds to read 24 hours of data (this doesn't include file fetching, validation, etc.) So roughly 3 minutes per day that you are getting is more reasonable. If you now run your code again using the develop branch (which doesn't use pydmap right now) does your code take ~8x less time to run?

Also, is 8x slower really a big problem?

The file fetching and validation code in davitpy could be optimized a bit. For example, ssh connections are always opened before checking if cached files exist. And there's other silly things done that could be optimized to speed things up.

ksterne commented 8 years ago

For the question of is 8x slower really a big problem, I would think so. This pull request is supposed to solve the issue of using davitpy to do statistical studies which require large data pulls. Doing large data pulls and slowing things down don't seem to go together. While I am very thankful that my code now runs without crashing and I have to restart it where it crashed, it is taking quite some time.

I've found another potentially big problem here. It seems as though some of the data types have changed going from the develop branch to this branch. Open up a fitacf data pointer and look at the data type of the variables between the two. Here's some sample code:

import datetime
from davitpy import pydarn

sDate=datetime.datetime(2014,7,8,0,0)
eDate=datetime.datetime(2014,7,8,4,0)
radar='bks'

#pydarn.plotting.plot_rti(datetime.datetime(2012,9,22),'rkn',datetime.datetime(2012,9,22,3))
remote_fnamefmt = ['{date}.{hour}......{radar}.{ftype}','{date}.{hour}......{radar}...{ftype}']
myPtr = pydarn.sdio.radDataOpen(sDate,radar,eDate,fileType='fitacf',local_fnamefmt=remote_fnamefmt,remote_fnamefmt=remote_fnamefmt)
myData = pydarn.sdio.radDataReadRec(myPtr)

print type(myData.fit.slist)
print type(myData.fit.phi0)

On the develop branch you'll get list whereas on this branch you'll get numpy.ndarray. Are all the other parts changed in type? I'm not sure a good way to do this easily.

Will this change how things are done? Possibly not, but I'm running some code that was behaving before and is now struggling because of the data type change. Is there a reason these were changed to a numpy array? I get that it may make things easily to manipulate, but changing the data type without a documentation or announcement may not be good.

asreimer commented 8 years ago

After some time to think about this, is the speed still an issue? If it really is, is there a point in me spending any more time on this?

aburrell commented 8 years ago

I did a bit of speed testing on numpy arrays versus lists, and lists are usually significantly faster (about 2x as fast).

The scope of this is getting really large. @ksterne, would you be comfortable doing a speed-test with a local file structure (even setting up a temporary one for testing) so that any speed issues with pydmap can be solved independently of fetchUtils? We can then start tackling the fetchUtil problems separately.

asreimer commented 8 years ago

We should be careful about comparing what is faster than what. There are good reasons that people use numpy arrays instead of python lists. Of course, there are also good reasons to use lists instead of numpy arrays too.

The potentially big problem that @ksterne alluded to isn't a problem at all. We currently don't explicitly type the data that is read from a dmap file for use in davitpy (ie. if we had data read in from an hdf5 file, what is the type for the arrays it reads in? This would need to be explicitly converted for use in davitpy.). This could be done in radDataTypes.py and sdDataTypes.py and wouldn't cost much in terms of speed. If we want everything to be lists, then we can explicitly convert them to lists. EDIT: I should add that the current workflow doesn't make sense in a server environment where all data is locally accessible. Why are we creating so many cached files?

There are some issues with the usage of fetchUtils in this pull request, but those can be easily fixed. For example, is there any particular reason that we have to check for files remotely every time we want to open a file? Perhaps we should make the workflow into this: user explicitly gets file -> user opens file with davitpy, which creates an object that has all plotting methods attached to it -> user plots data using davitpy object. For example: davitpy.fetch(datetime(), rad, fileType) -> obj = davitpy.read_files(datetime(),rad,fileType) -> obj.plot_rti()

Way off topic now... We should probably have a meeting to discuss things. Also, we should have a users meeting at some point to see what the users of davitpy actually want.

Getting back to the issue of speed here, the real problem is that it's python, and not C. There are no speed issues with pydmap other than it's written in python. Specifically, the slowest part of the pydmap code is the struct.unpack_from. @kkotyk (a Computer Engineer, ie: has formal computer science training that I don't have) has already sped this up as much as he could so I'm very skeptical we can make the python based dmap file reading any faster than it already is. That being said, if someone else can figure out a speed improvement please do (besides the obvious, "use pypy instead of python").

So if we are only talking about speed here, the way I see it, we have 2 options going forward. 1) Fix the C based dmap python wrapper (basically refactor and rewrite the dmap.c and dmapio.c code, any takers?) 2) take the speed hit and have a purely python dmap file reading code.

If we need to discuss this in more detail, we should probably have a meeting. Currently, development on this has stalled because I'm waiting for people to either say, "Yes, the speed hit is fine and I like that our file reading/plotting code would all be in python." or "No, let's just try to fix the C code instead and make sure it can also compile on all OSes."

ksterne commented 8 years ago

@aburrell, would you still like for me to do a speed test on things? I think @asreimer noted about the speed being about 10x slower with python. Maybe it's good to put some real numbers behind things.

@asreimer, I think the discussion on cached files and the like is saved for another place. The short of it is the @ajribeiro added it in at the request of someone in Europe that didn't have such a great internet connection. If you're back to 56k speeds or something like that, a lot of the time can take just transferring the files. This could be especially annoying when you're just learning davitpy and need to change one small thing, but then would need to re-download a file. Caching makes it quicker to adjust/tweak.

Anywho, what if there was a third option @asreimer? Is it possible to keep the C dmap library in the code and it's a different call or different option to use C versus python? I get that bloats things, but that maybe makes everyone happy? The default could be to use the python library and then if there's not a lot of complaints on the speed issue, or no one seems to be using the C dmap library, we could depreciate it and eventually remove it.

ajribeiro commented 8 years ago

I've been lurking via email updates, and I just wanted to chime in and say that this is AWESOME! Great work guys!

asreimer commented 6 years ago

At developer's meeting, decided to add a flag to pick python vs C version. So support both.

ksterne commented 4 years ago

Seems as though we never got much farther with the development here. I'm closing this for now since this repo is being deprecated. Good news is I think this code did make it into pydarn...or at least there was some new thinking on file IO.

vtsuperdarn / davitpy

Replace RST dmapio with Python based dmapio #249