uwescience / sqlshare

Documentation and help for the SQLShare project
escience.washington.edu/sqlshare
7 stars 2 forks source link

Query results should be streamed. #11

Closed dhalperi closed 11 years ago

dhalperi commented 11 years ago

See uwescience/sqlshare-pythonclient#5. User couldn't download a big query result that is the result of a union of a raw query. When I ran the same program locally, after 5 minutes there was still no data flowing.

(updated now that I know what's going on:)

The REST API call to /v1/db/file?sql= materializes the results in memory on the REST server before sending them to the client. This API needs to be changed to stream results on the fly instead.

billhowe commented 11 years ago

Sounds like the query itself is taking a very long time?

On Wed, Jul 17, 2013 at 11:55 AM, Daniel Halperin notifications@github.comwrote:

See uwescience/sqlshare#5https://github.com/uwescience/sqlshare/issues/5. User couldn't download a big query result that is the result of a union of a raw query. When I ran the same program locally, after 5 minutes there was no data flowing. Something's going on, but I'm not sure what yet.

— Reply to this email directly or view it on GitHubhttps://github.com/uwescience/sqlshare/issues/11 .

dhalperi commented 11 years ago

I've tracked it down a bit more. The SQLShare UI uses an API that streams tuples from SQLShare to the client, when downloading a file.

The pythonclient API uses "SELECT * FROM table" to download a dataset. However, the API call that returns SQL query results to a client, i.e. not the same API as above, materializes the entire query result in memory, then streams that to the client. When the query is 5GB it probably crashes.

This will be another TODO :)

dhalperi commented 11 years ago

(edited the previous comment to be more clear than it was in your email inbox.)

billhowe commented 11 years ago

On Wed, Jul 17, 2013 at 1:21 PM, Daniel Halperin notifications@github.comwrote:

I've tracked it down a bit more. The website uses an API that _streams_tuples from SQLShare to the client.

The API that executes SQL materializes the entire query result in memory, then streams that to the client. When the query is 5GB it probably crashes.

But I guess the issue is that the streaming API does not support arbitrary SQL?

— Reply to this email directly or view it on GitHubhttps://github.com/uwescience/sqlshare/issues/11#issuecomment-21141756 .

dhalperi commented 11 years ago

Right, the streaming API just takes a dataset name as a parameter. It's just a bug that needs to be fixed.

dhalperi commented 11 years ago

(renamed and updated the issue description now that I know what's going on)

schitnis commented 11 years ago

Thanks Dan! http://msdn.microsoft.com/en-us/library/ms733742(v=vs.100).aspx is the approach suggested my msdn.

My main concern is that there may be several places in the code that assume everything can be in memory at many stages (even before we send it back to the client).

In any case, I will take a look at some APIs.

Thanks, Sagar

dhalperi commented 11 years ago

Hi Sagar,

Looking at the actual SQLShare code,

        // NOTE: Currently format is being ignored
        public Stream SqlToFileV1(string sql, string format, string filename)
        {
            if (String.IsNullOrEmpty(filename))
                filename = "untitled";
            ExecuteSqlManagerV1.QueryResultEx reply = SqlSynchonousExecute(sql);

seems to explicitly materialize the entire query result.

Compare to

       public Stream DatasetToFile(string name, string schema, string format)
        {
            SSAPIFactory fac = CloudUtil.CreateSSAPIFactory();
            try
            {
                UserSession session = InitUserSession(fac);

which does not.

schitnis commented 11 years ago

Hi Daniel,

When they originally designed these APIs, I believe most of the APIs (like the SqlToFileV1) were designed to use these classes that stored the data.

For example: The QueryResult (from which QueryResultEx above is derived) is public class QueryResult { public int rows_total { get; set; } public double exec_time { get; set; } public string[] header { get; set; } public string[] type { get; set; } public object[][] data { get; set; .............. i.e. they assumed that all data can be held here (in memory).

The best way to fix this is to bypass the old code that used these old classes.

I will work on this one SqlToFileV1.

What I meant to say earlier was that although this can be fixed , there may be several places in the code that need fixing.

But I can start with SqlToFileV1.

dhalperi commented 11 years ago

Sagar,

You're right that this problem may exist in lots of places. The bug is to fix this particular API call, right now, because we have users that want it to work.

Here is the URL that the UI uses, and it does the right thing:

/v2/db/datasets/{user}/{dataset}/result

See, for example, https://rest.sqlshare.escience.washington.edu/REST.svc/v2/db/dataset/sqlshare72513@gmail.com/reuters_terms.csv/result

I want the API linked above to also do the right thing.

We can worry about scanning every API call later.

Thanks! Dan

dhalperi commented 11 years ago

Hi Sagar,

Which commit fixed this issue?

The download speed now seems very slow, but that could be an Azure artifact. I'm also surprised that the download takes many seconds to start---the website download starts almost instantaneously. Is this expected?

Thanks! Dan

dhalperi commented 11 years ago

Hi Sagar,

The download still failed with the same error. Here's the command and the message:

% python tools/fetchdata.py -d "[sr320@washington.edu].[BiGill_methratio_v9_A.txt]"  -f tsv -o ~/Desktop/test.txt
Traceback (most recent call last):
  File "tools/fetchdata.py", line 77, in <module>
    main()
  File "tools/fetchdata.py", line 71, in main
    data = fetchdata(args.sql, args.format)
  File "tools/fetchdata.py", line 30, in fetchdata
    return conn.download_sql_result(sql, format)
  File "build/bdist.macosx-10.8-x86_64/egg/sqlshare/__init__.py", line 336, in download_sql_result

  File "build/bdist.macosx-10.8-x86_64/egg/sqlshare/__init__.py", line 249, in poll_selector

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 543, in read
    return self._read_chunked(amt)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 585, in _read_chunked
    line = self.fp.readline(_MAXLINE + 1)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 241, in recv
    return self.read(buflen)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 160, in read
    return self._sslobj.read(len)
socket.error: [Errno 54] Connection reset by peer
python tools/fetchdata.py -d  -f tsv -o ~/Desktop/test.txt  4.82s user 2.63s system 2% cpu 5:57.97 total
schitnis commented 11 years ago

Hi Daniel,

1)The downloads start instantaneously from my machine. Maybe it was some temporary Azure thing when you tried it?

Please can you try again?

https://sqlshare-rest.cloudapp.net/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[che625@washington.edu].[BiGO_Methylation_oysterv9_GFF]

https://sqlshare-rest.cloudapp.net/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGo_methratio_GFF_boop]

2)CL#850 assembla - see RestFileDownload.cs

Thanks, Sagar

schitnis commented 11 years ago

Saving all email discussions for reference(with email addresses removed)

On Fri, Aug 2, 2013 at 7:50 PM, bill wrote: True, my fault for not thinking carefully.

Just change the filename to "sqlshare_query.csv" On Fri, Aug 2, 2013 at 4:15 PM, Daniel wrote: No, this call executes arbitrary SQL. Table name is not clear, and may not exist. (SELECT 1)

From: sagar To: bill CC: dan Subject: RE: Test cases Date: Fri, 2 Aug 2013 16:04:21 -0700

Hi Bill, I did not realize this earlier. We had fixed a similar issue for downloading a dataset and we fixed that but this is an sqlquery that could be "select 1" Also the sql query could contain multiple table names. We do have code to parse out the table names. Should we use a combination of all table names for naming the downloaded file? Thanks, Sagar

On Friday, August 2, 2013, Bill wrote: Per our conversation: looks good, but make sure the filename is set to the table name as with the other call.

On Wed, Jul 31, 2013 at 3:36 PM, Sagar wrote: Hi Dan,

I have deployed the fix. (There is a special case that still needs fixing.)

The following two URLs(for the test cases you suggested) would fail earlier. Now they download the data.

https://sqlshare-rest.cloudapp.net/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[che625@washington.edu].[BiGO_Methylation_oysterv9_GFF]

https://sqlshare-rest.cloudapp.net/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGo_methratio_GFF_boop]

Thanks, Sagar From: dan Date: Tue, 30 Jul 2013 16:45:23 -0400 Subject: Re: Test cases To: sagar CC: bill

Hi Sagar,

SELECT * FROM [sr320@washington.edu].[BiGo_methratio_GFF_boop]

SELECT * FROM [che625@washington.edu].[BiGO_Methylation_oysterv9_GFF]

Are two big ones.

Thanks, Dan

On Tue, Jul 30, 2013 at 4:21 PM, Sagar wrote: Hi Bill / Daniel,

Please can you suggest 2-3 testcases for sql queries that return extremely large data (that would not be enough for memory)?

Thanks, Sagar

schitnis commented 11 years ago

Hi Daniel,

For the python error: Could it be a bug in the python code above?

It looks like from your logs you are probably trying : https://sqlshare-rest.cloudapp.net/REST.svc/v1/db/file?sql=select%20*%20from%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt] THis seems to download fine with the correct headers(200 OK)

Thanks, Sagar

dhalperi commented 11 years ago

It's also got a &format=tsv parameter. I am trying right now via wget and we'll see if that finishes.

The Python code ran for several minutes and then claimed the server dropped the network connection. I could imagine it being a different problem, however!

Will send an update once the wget job finishes.

 wget 'https://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsv'
--2013-08-05 12:18:34--  https://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsv
Resolving rest.sqlshare.escience.washington.edu (rest.sqlshare.escience.washington.edu)... 65.52.16.106
Connecting to rest.sqlshare.escience.washington.edu (rest.sqlshare.escience.washington.edu)|65.52.16.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘file?sql=SELECT * FROM [sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsv’

    [                                   <=> ] 593,592,441  449KB/s     

Dan

dhalperi commented 11 years ago

Hi Sagar, were you able to download that file to completion? Even with wget, it stalled after about 1.5GB download.

schitnis commented 11 years ago

Hi Daniel,

Yes, I was able to download it it to completion (even with the tsv option).

https://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsv

I used google chrome and it downloaded 5,212,574,528 bytes. (approximately 4.85 GB)

-Sagar

billhowe commented 11 years ago

Tried it from an EC2 instance and it seemed to work.

https://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsvhttps://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20%5Bsr320@washington.edu%5D.%5BBiGill_methratio_v9_A.txt%5D&format=tsv

5,212,574,528 2.58M/s in 33m 58s

On Mon, Aug 5, 2013 at 3:12 PM, schitnis notifications@github.com wrote:

Hi Daniel,

Yes, I was able to download it it to completion (even with the tsv option).

https://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20[sr320@washington.edu].[BiGill_methratio_v9_A.txt]&format=tsvhttps://rest.sqlshare.escience.washington.edu/REST.svc/v1/db/file?sql=SELECT%20*%20FROM%20%5Bsr320@washington.edu%5D.%5BBiGill_methratio_v9_A.txt%5D&format=tsv

I used google chrome and it downloaded 5,212,574,528 bytes. (approximately 4.85 GB)

-Sagar

— Reply to this email directly or view it on GitHubhttps://github.com/uwescience/sqlshare/issues/11#issuecomment-22144940 .

dhalperi commented 11 years ago

Fixed here and in the python utils