ireceptor-plus / issues

0 stars 0 forks source link

VDJServer downloads #52

Closed schristley closed 1 year ago

schristley commented 3 years ago

VDJServer has limits (data size, timeouts) on the synchronous ADC API which make downloads cumbersome or infeasible for large downloads. Will implement an asynchronous query API.

bcorrie commented 3 years ago

@schristley should we close the old issue ireceptor-plus/WP2#8

schristley commented 3 years ago

We recently got the LRQ (long running query) API from Tapis, but will not have a chance to implement by M24 so removing from D1.6

schristley commented 3 years ago

Hey @bcorrie @jeromejaglale , I have a prototype working to accept asynchronous queries, which get passed to TACC's LRQ API to perform the query, and then notifies when the query is done. The raw output data gets put into a Tapis file. I discovered about 5-6 issues with the LRQ, they are all mostly minor and the core functionality works well. I need to do a little more work to format the raw data so it can be downloaded in it's final format.

The idea here is to replace the current "hack" in the gateway download code which iterates queries to get the data, into a single asynchronous query request. The question for you is how (and if) you want to be notified when the data is ready? The basic notification mechanism is to send a http/https request, but this requires that you have a public end point to accept the notifications. Of course, you can poll though it's not as efficient.

The other question is retrieving the data? The data is stored behind the Tapis authentication wall but I can provide a URL for file download. I think that is better than streaming the data back through an end point.

I also have these questions in the PR for the async API.

jeromejaglale commented 3 years ago

Polling, then download using an URL, at least to start with?

As a first implementation, we're probably going to to keep the gateway download job running, checking regularly until the download is ready, and then do the actual download..

Does that seem reasonable?

Also, how long will the downloads be kept? Any restriction on the number of rearrangements? Or would downloading all rearrangements from VDJServer be potentially possible? :-)

schristley commented 3 years ago

Polling, then download using an URL, at least to start with?

As a first implementation, we're probably going to to keep the gateway download job running, checking regularly until the download is ready, and then do the actual download..

Does that seem reasonable?

Yep. We haven't actually designed the notification protocol, but polling should work from the start.

Also, how long will the downloads be kept? Any restriction on the number of rearrangements? Or would downloading all rearrangements from VDJServer be potentially possible? :-)

These will be config parameters, so we can adjust them. Feel free to suggest values. There will be an upper limit on number of rearrangements. I don't think it should be as high as the whole DB (2.5B currently). I think you guys have a limit right, maybe 500M? I think that's roughly 1TB uncompressed which can still take awhile to download...

bcorrie commented 3 years ago

These will be config parameters, so we can adjust them. Feel free to suggest values. There will be an upper limit on number of rearrangements. I don't think it should be as high as the whole DB (2.5B currently). I think you guys have a limit right, maybe 500M? I think that's roughly 1TB uncompressed which can still take awhile to download...

Yes, we are set at 500M at the moment, which is slightly larger than our largest study. So that seemed to make sense.

I should point out that this is a Gateway limit for us, not an API limit. The API will not restrict the download, you can just ask for everything if you want 8-). Of course, our 500M limit on the Gateway is about the same as our largest repository, so it is basically equivalent...

From our experience with large downloads, we don't really see many issues at the API level (network drops, transfer issues) so I don't think going larger would be an issue other than it would take longer.

Our main reason for the limit is we didn't want to encourage people to download the whole ADC (4 billion rearrangements) with the click of a button - which would be way to easy to do on the Gateway otherwise 8-)

schristley commented 3 years ago

Just like with the normal ADC API. We should define these limits and have them exposed in the /info endpoint so that clients know the settings.

jeromejaglale commented 3 years ago

Sounds good. Ideally, it would be nice if the user (gateway or API) can download any study at once? So, when the limit is reached, the error message can include a workaround: downloading one study at the time?

schristley commented 3 years ago

Sounds good. Ideally, it would be nice if the user (gateway or API) can download any study at once? So, when the limit is reached, the error message can include a workaround: downloading one study at the time?

Possibly, but right now the largest study in VDJServer has ~1B rearrangements. There is a second study still to be loaded that is just as large. How long does it actually take to dump 500M from a single repository?

schristley commented 3 years ago

Hey @bcorrie @jeromejaglale , I have a staging service running for you to try. Right now the maximum is fairly low at 30M. Under some circumstance, space on the DB gets used to satisfy the query, and we are low on disk space until TACC upgrades us (in 2-3 weeks I hope), so I want to avoid accidentally crashing the DB. There's still a bunch of error checking code and cleanup I need to do before it can go live.

This staging service also has the normal ADC API running. I would suggest turning it on your staging service to test. Note that I'm using OpenAPI V3 now, for both the ADC API and for the ASYNC API. One thing I've noticed with the OpenAPI V3 middleware is that it is more strict about content type, you need to specify application/json when sending JSON.

Here's an example on how to use it. The ASYNC API is meant to accept the same input as the ADC API. Here's a simple query file:

{
  "filters": {
    "op": "=",
    "content": {
      "field": "repertoire_id",
      "value": "2564613624180576746-242ac113-0001-012"
    }
  },
  "format":"tsv"
}

Send the POST query request:

curl -k -H 'content-type: application/json' --data @repertoire_single.json https://vdj-staging.tacc.utexas.edu/airr/async/v1/rearrangement

which returns the result:

{
  "message": "rearrangement lrq submitted.",
  "query_id": "7208214340489646571-242ac118-0001-012"
}

You can then request the status:

curl -k https://vdj-staging.tacc.utexas.edu/airr/async/v1/status/7208214340489646571-242ac118-0001-012
{
  "query_id": "7208214340489646571-242ac118-0001-012",
  "endpoint": "rearrangement",
  "status": "SUBMITTED",
  "message": null,
  "created": "2021-03-05T08:59:31.531-06:00",
  "final_file": null,
  "download_url": null
}

A request goes through multiple stages as reflected by the status. Upon initial request, it is PENDING. The next stage is COUNTING where it determines the size of result set, it goes into ERROR status if greater than maximum, otherwise it goes to SUBMITTED to extract the data. The raw data from the database isn't in the proper AIRR format, so the next stage is PROCESSING which formats the data into a final file. Finally a public download URL is created and it is FINISHED. The ERROR and FINISHED are both final states.

{
  "query_id": "7208214340489646571-242ac118-0001-012",
  "endpoint": "rearrangement",
  "status": "FINISHED",
  "message": null,
  "created": "2021-03-05T08:59:31.531-06:00",
  "final_file": "7208214340489646571-242ac118-0001-012.airr.tsv.gz",
  "download_url": "https://vdj-agave-api.tacc.utexas.edu/postits/v2/964c8e41-8fc9-48da-97a4-d178cb6d864e-010"
}

Now you can download the data with URL. The data is already gzipped.

curl -o data.airr.tsv.gz https://vdj-agave-api.tacc.utexas.edu/postits/v2/964c8e41-8fc9-48da-97a4-d178cb6d864e-010

For the download URL, I'm using the Tapis postits API with an expiration (currently short at 1hr for testing) and maximum uses (1000 which should be enough for plenty of retries), so expiration is handled by postits, that is, when the postit expires, the download URL returns an error instead providing the data. What I still need to write is an expiration handler, something that goes through all of the expired queries on a periodic basis and cleans stuff up. I'm thinking that I will change the query status to EXPIRED.

bcorrie commented 3 years ago

Sounds good. Ideally, it would be nice if the user (gateway or API) can download any study at once? So, when the limit is reached, the error message can include a workaround: downloading one study at the time?

Possibly, but right now the largest study in VDJServer has ~1B rearrangements. There is a second study still to be loaded that is just as large. How long does it actually take to dump 500M from a single repository?

@schristley downloading everything from IPA5 (a single large study) is 426,431,225 rearrangements, it took 15 hours. That is to extract the data from Mongo and stream it to the client who is downloading it (the Gateway in this case). This was a user download on Jan 26, which is the most recent download of that size. I think that is probably our largest study...

schristley commented 3 years ago

downloading everything from IPA5 (a single large study) is 426,431,225 rearrangements, it took 15 hours. That is to extract the data from Mongo and stream it to the client who is downloading it (the Gateway in this case).

That's pretty good, almost 30M / hour. I'm getting nowhere near that. I'm doing a download for a ~16M repertoire and it's taking roughly 2 hours, plus it might take another hour to format into the final file. I'm not sure if it's because the data is being written to the network disk, because it's being gzipped, or Tapis discovered the slowest way to extract data. Or some combination of the three...

bcorrie commented 3 years ago

We stream uncompressed, the Gateway combines all of the uncompressed downloads and then ZIPs them into a single archive that provides structure to the overall download (repertoire/rearrangement file pairs from different repositories and a summary info file). The compression on the Gateway takes as long as the download... We choose to not compress on the repository service because the bottleneck is not the network in this case. The slow part is the extraction from Mongo and the compression. We stream from Mongo directly to the network so the data doesn't touch the disk and doesn't get compressed as it streams out over the network.

schristley commented 3 years ago

If you had the choice of getting compressed or uncompressed data from VDJServer, which would you prefer? I suppose we could also think about adding a parameter to the API? While transfer from TACC over to CC should be fast, I was assuming compressed because regular users rarely have such a big network pipe.

It's also a bummer when putting gz into a zip, it's double compression that's not needed.

schristley commented 3 years ago

@bcorrie I'm curious, for your downloads, do you just hit each of your repositories through the ADC API? Or do you optimize a little by going directly to the DB. I guess it's the former because it ends up being the exact same code anyways...

schristley commented 3 years ago

... plus it might take another hour to format into the final file.

I'm wrong, it's taking closer to 3-4 hours to format the file, which involves a gunzip and a gzip. It has to be the compression that's eating up all the time because the file is only ~5GB.

schristley commented 3 years ago

downloading everything from IPA5 (a single large study) is 426,431,225 rearrangements, it took 15 hours. That is to extract the data from Mongo and stream it to the client who is downloading it (the Gateway in this case).

@bcorrie What's the approximate uncompressed data size, close to 1TB?

jeromejaglale commented 3 years ago

@schristley , is /repertoire working for the staging service? I tried both https://vdj-staging.tacc.utexas.edu/airr/v1/repertoire and https://vdj-staging.tacc.utexas.edu/airr/async/v1/repertoire and they return a 500 error:

[{"message":"Unsupported Content-Type application/x-www-form-urlencoded"}]

schristley commented 3 years ago

Hi @jeromejaglale yes, the service is up and running. I think you are hitting an OpenAPI V3 middleware issue, I've discovered it is much more strict about the content-type, so when you send JSON you need to set the appropriate content type. Can you try setting the header "content-type: application/json" with your request?

jeromejaglale commented 3 years ago

@schristley , this request returns {"message":"Unknown error"}:

{
  "filters": {
    "op": "in",
    "content": {
      "field": "repertoire_id",
      "value": [
        "2586732705754976746-242ac113-0001-012"
      ]
    }
  },
  "format": "tsv"
}

Is it because it's an in query and the value array has only one value? But that should work, right? With SQL, for example, it works fine. For example: select * from some_table where id in (15);

schristley commented 3 years ago

It's a bug in my code. I'm trying to store a Mongo query and it doesn't like that; I need to rework that.

schristley commented 3 years ago

@jeromejaglale ok, I put in a fix, your query looks to be working.

schristley commented 1 year ago

async API is in production