Closed amckiern closed 7 years ago
That is slow. I will investigate this. I should be able to simulate this on my end with an equivalently large table.
Can you send me a contents() of that table? I'm playing with a table that has less rows, but more columns than yours and is 66M on disk. pulling that over into a dataframe. I'm seeing it take about a 76 seconds, but only 10 of that was the data transfer. 1 second for the various calls to SAS before the transfer, and another second to split that all up based upon separators. 2 seconds to create the dataframe and then, over a minute doing the conversion to either numeric or datetime for non-character columns. That's the slow part there. I just loop through the columns on the dataframe and do the following.
For any numeric columns I call: pd.to_numeric(df[varlist[i]], errors='coerce') and for date/time/datetime I call: pd.to_datetime(df[varlist[i]], errors='ignore')
These calls are where all the time is going. Next is to see if there's a way to do this faster.
Hopefully this is okay for you. If not, let me know. spc_results_contents.txt
Cool, that worked. I see your have about 10 numeric columns and although the data is 6G, you are only pulling over about 50M from your query (you said; 144000 rows). That's 2 times the size of what I was working with, but my table had many more columns and less rows. I can build out data that's equivalent to yours and see how long this takes for me. I'll also be able to see if the dimensions (rows vs. columns) make a difference in these routines where 80% of the time is spent.
I'll still need to figure out if there's something I can do that works faster than those routines.
Ok, I've created a table that is equivalent is dimensions to your result set (144,000 x 32 w/ 10 numeric) and it takes only 5 seconds to import as a data frame on my system. Also, I was mistaken before that you data was 2x mine, actually they were close but the dimensions were very different (30,468 x 2240 w/ 2212 numeric). In my case, the extra 2200 columns, even though a fifth the rows, took over a minute in the pandas.to_numeric routine, while to 10 columns of your data even w/ 5 time the rows, took only 1-2 seconds.
I'm attaching 2 files. One is the simple test case I used to simulate your data. You can run this directly and see if takes the same time as your real case (see if I'm comparing apples to oranges), and the other has the sasdata2dataframe code with timings that I'm using to see where the time is spent (simplistic but effective).
Are you able to swap out a copy of sasioiom.py in your install? If you can make a copy of that file, and then replace that routine with this copy of the routine and run both your real case and my test case, we can see where the time is taking on your system. Just post back the results from those runs with the timing output (you can post it here or email it to me if you'd rather; whatever you want).
Thanks, Tom testcase.txt sasdata2dataframe.txt
Sorry, the source table has around 19 million rows and when I run the query on it, I get back around 144,000 rows. I sent you the contents() for the source table.
Yes, I got that which is how I created the test case. I doubt the data step that subsets the data is taking any significant time. You can see how long that takes in the saslog. If you can run the tests with the diagnostic routine, and send me the results, and the log from the data step, we can see where all this time is taking for you.
Hi Tom, for some weird reason I only see your most recent messages after I have posted a message, so my replies may be out of sync with your most recent message!
Anyway, I have added the new sasdata2dataframe() code into sasioiom.py and then I ran it against the real data and against the test case and they behave similarly. They both spend all their time in the transfer phase.
I have attached the results from both runs in the text file below.
Hey, no problem. It seemed we were out of sync a little. Yeah, that's completely different than what I'm seeing. I've attached my trace output for my test job; it transfers in 3 seconds and does all 10 to_numerics in 2 seconds (about 5 per second), whereas you have a 2.5 minute transfer and only get 2 or 3 to_numerics per socond. Did you see the SAS log for how long the data step took? Cuz I only see 7 seconds after the data shows up to create the DF and convert the data. That seems like there's a few more minutes to get 6 from that. I really can't imagine the data step took up that time, but we'll see. I believe we're running the same code (saspy), you just pulled to get the fix to even get this to work with your data, right? I hate going here, but it seems like infrastructure/resources on your system? Are you able to run any diagnostics on your system (where python/saspy are running) to see if you may be memory, disk, network or cpu bound? Any of those could account for it taking longer than what I'm observing. test1.txt
In fact, your run of my test case, where there was no data step subsetting (data step took 0.4 seconds), it took 5 minutes to transfer the data and 10 to finish. That's a long time, as my transfer for that was 3 seconds with only 2 more to finish. Slow network???
I'm on a company LAN and am not experiencing any network problems and I can pull large volumes of data from other sources without any issue. My laptop doesn't appear to be memory, disk or cpu bound.
I installed saspy on a Windows server with lots of CPUs and RAM and I am seeing the same issue.
I modified sasioiom.submit() to record when saspy receives each block from the sas server and then grouped the data by second and plotted it and it looks like data comes in quite quickly at the beginning and tails off dramatically after that, but I still cannot see if that tail-off is happening on the sas server or on my laptop.
Hey! I know it's been a while, but I've finally had a chance to investigate this. I think I found the main bottleneck. Someone here saw this w/ a 7G sas data set that was taking overnight to load into a dataframe. That really stunk. I reworked something on the python side and dropped this to 2 minutes. I've pushed the fix to master. Do you think you can pull and give it a try with your use case and see if it fixes you too? I expect it will.
Thanks, Tom
Hi Tom, in the last two days I started looking at this again and now you have a fix! Can't wait to try it. I will let you know how it goes. Thanks.
Hey, I found a bug in that fix. I just pushed a fix for that. So, if you already pulled the fix, grab it again to get this latest. BTW, you have to get it from the repo, it's not in the current pypi package (but you know that I'm sure). Thanks! Tom
Hi Tom, that has made a huge improvement to the download rate! Thanks, Adrian.
Great! Sorry that took so long. Too many other things going on around here. Think it's good enough to close this out?
Yes, thanks again Tom.
Hi, this could be related to changes made to sasioiom.py as a result of issue #37, I'm not sure.
This code is being run on the server:
When it runs it returns 144,000 rows (23 columns) approximately but it takes around 6 minutes to get the data back.
When I run almost the same query via Enterprise Guide it takes a few seconds to return the data, but in that case _tomods1 was pointing to a file on disk. Using proc sql on EG also takes a few seconds. If I try this it works just as fast:
c = s.submit("""proc sql;create table work.test as select * from engnr_cl.spc_results where pointdate >= "01may2017 00:00"dt;quit;""")
It doesn't appear to be a problem with the creation of the dataframe, that happens very quickly, it looks like it might be how the data is received in the submit() method, but I'm not sure.
Should it be taking this long to retrieve the data?
Thanks.