kbase / staging_service

MIT License
0 stars 9 forks source link

Add the ability to parse a DTS transfer manifest into bulk import specifications #200

Open jeff-cohere opened 4 months ago

jeff-cohere commented 4 months ago

Now that the Data Transfer Service is ostensibly working, it would be convenient for the staging service to be able to parse the contents of a manifest.json file (deposited in a transfer-specific directory within a KBase user's staging area) into one or more xSV files that can then be imported using the existing bulk import functionality. I'm using this issue to track my investigation of how we might add such a feature.

Assessment

The KBase Narrative itself defines a SetupImportCells function that interacts with the staging service to parse files for import, including

Downstream, this function calls a bulkSpecification method which is mapped to the staging service's bulk_specification endpoint here.

Upstream, the function is called by the KBase staging area viewer UI code here.

Proposed approach

We could modify the SetupImportCells function to add support for the intermediate parsing of a DTS transfer manifest which produces a set of import specification spreadsheets. Currently, the function produces its "import app cells" in 3 stages:

  1. Filter the file infos into bins based on their given type - single upload, bulk upload, or xsv (that's CSV, TSV, Excel)
  2. Get all the extended file info from the xsv files to push into the bulk uploader section
  3. Make the bulk import and other singleton import cells.

We could modify this procedure to inject logic to parse a DTS transfer manifest if it was found among the selected set of files (passed to the function in fileInfo):

  1. Filter the file infos into bins based on their given type - single upload, bulk upload, DTS manifest, or xsv (that's CSV, TSV, Excel)
  2. Parse any selected DTS manifest file, writing out xsv files and adding them to the list of xsv files to parse for individual files.
  3. Get all the extended file info from the xsv files to push into the bulk uploader section
  4. Make the bulk import and other singleton import cells.

We would have to add another endpoint to the staging service to support stage 2, but this seems easier than modifying existing endpoints and changing their assumptions. I'll continue to update this issue as I find out more.

Risks / Constraints

Because this approach is incremental and depends on the existing import specification machinery in the staging service, it carries with it all of the related limitations and drawbacks:

  1. Speed: the bulk import spec can be very slow for >500 files, making the Narrative unresponsive in the process.
  2. UI clunkiness: any manifest containing files of differing "data types" necessarily generates multiple import specification spreadsheets. This might be confusing for users if many such spreadsheets are generated from a single manifest.

I'm more concerned about item 1 (can we improve the performance of the staging service without a ton of work?). But I assume that we'll eventually want something a bit more automated or less finicky for users to work with.

References

briehl commented 4 months ago

In general, I think this is fine. I have some thoughts in no particular order:

Just so I understand the whole workflow / user story here:

  1. User uses the DTS to import files from outside of KBase, and not on their own computer (i.e. IMG).
  2. These data files get imported to their KBase staging area along with a manifest from DTS.
  3. Something TBD (probably either the DTS or staging service) parses a manifest file and produces one or more xsv files to the staging area.
  4. User uses those XSV files to import the data as in the usual narrative data import workflow.
  5. There's now a bunch of data objects from an external data source ready to operate on in the user's narrative.

I think other folks have gone over the main constraint enough - the Narrative UI's basically inoperable for > 500 data objects at a time. This is mostly because of how our data import MVP never really got past MVP. The next steps would be to handle that with some UI work. That should be part of this project if going through the Narrative UI is the best way.

Although, I'd have to say if you're trying to manually validate > 500 lines in anything - XSV file, or narrative import cell - that sounds like a really tedious job and probably prone to some user error anyway. Still, there's enough automated error detection in there for missing or malformed strings and the like that it might be useful.

Anyway, I see a few options here, with some pros and cons to each.

1. Go as outlined and use the bulk import cell.

This will entail the following:

a. Address the UI issues about having lots of files input at a time. I can't really make a good estimate at how much work this will be, but I don't think it's entirely trivial. Having the UI paginate the inputs down to, say, 20-100 at a time won't be terrible, but might have hidden UX caveats. b. Probably combining all manifest files into a single excel file with multiple sheets would be the simplest for users, and is already supported. Otherwise, it's best to dump them all (all of the translated manifest files) into the same directory at the top level of your DTS import directory. c. Another manifest file option might be to add that file as an option for the Import Specification dropdown option. When a user selects "Import Specification" an a file type, it sends that file off to the staging service to get parsed and returned. So we'd just have to teach the staging service how to interpret that JSON file and return its contents in the format the front end expects. This would skip having to make an XSV file out of it.

2. Work outside the Narrative interface

Another option is to skip the narrative all together, or come up with some other interface. The bulk import cell is a really fancy code generator - it generates a set of inputs that start a bulk app job runner which is an API call to the execution engine. The bulk import cell has a bunch of other helpful features, like input validation and job monitoring and restarting. But in the end, it just sends off a ton of import jobs to the execution engine and watches them run, then we wind up with data in the narrative. Short-cutting that would also be clunky as hell and hard to monitor, but I'd be remiss if I didn't put it as an option. :)

briehl commented 4 months ago

Really I think a few steps here would handle most of this task:

  1. Get the staging service to parse manifest.json files as bulk import specifications.
  2. As I recall, the main issue in the UI is navigating hundreds of import file lines at once. It's worth spending some time looking at how to deal with that in a way that doesn't crash the browser.

There's also the issue of adding GFF + FASTA genome imports to the staging service, and what ripple effects that'll have on the interface, but if that's part of the MVP we have to handle that anyway.

jeff-cohere commented 4 months ago

Thanks for the thoughtful analysis, Bill! I think I can tackle item 1 fairly easily as long as we decide specifically on input (/<user>/<transfer-uuid>/manifest.json) and output (/<user>/<transfer-uuid>/manifest.xlsx?)

MrCreosote commented 4 months ago

I'm more concerned about item 1 (can we improve the performance of the staging service without a ton of work?)

Just to be clear, it's not the staging service that's the issue, it's the narrative import UI. I would guess the staging service could handle 100k item manifests with no issue

MrCreosote commented 4 months ago

My take on this is that's it's putting way too much impetus on the user and adding lots of complication when it's not necessary. If there's enough information in the DTS manifest to parse it into a bulk import manifest, there's enough info to just kick off jobs directly. What I would do is:

jeff-cohere commented 4 months ago

I agree completely that my proposal isn't great from a user's perspective. I was assuming that no one had a lot of effort to throw at this task besides myself, so I tried to figure out a way to elbow the DTS manifest into the existing machinery without adding another service. But if the lightweight service you mention could be added without disrupting anything (by me or someone else), and if there's no perceived need for a user to be able to "pick through things" before doing bulk imports, I'd love to go in the direction that you suggest.

jeff-cohere commented 4 months ago

By the way, we're focused on manifests as import instructions to avoid organization-specific machinery like KBase API calls. We want to see how far we can get in the direction of an "organization-agnostic" service.

MrCreosote commented 4 months ago

My guess is that updating the staging service to parse the mantifest + fixing the narrative UI would be comparable (maybe more) amounts of work to putting up a simple fastapi service to parse the manifest and kick off a job, but of course I'm looking at this from a very high level view.

Although... it might not need to be a service per se, it could just be a daemon. But making it a service makes it easier to monitor

jeff-cohere commented 4 months ago

If creating a new process (daemon or service) is practical, let's do it. Hacking on existing infrastructure and changing the way things work therein is much riskier than building something for a specific purpose in my experience.

briehl commented 3 months ago

I'm just catching up here.

It seems like it would be ideal to reduce or remove the user interaction here, especially if they already know what external data they're transferring, AND there's enough information about each data object to run each importer. I guess I'm saying that if it's possible to minimize the user having to sift through hundreds of lines of imports, we should.

I guess that could either be automatic or not. Automatic would be the service that @MrCreosote describes - it watches the DTN node for some directory that hosts the manifest files and launches the bulk import job based on those. A non-automatic way would be for a user to click a button to go through it, somewhere. Probably in the Narrative import panel. It could give a quick digest when it starts - maybe just a count of how many of each object type - and tell you how to get job status before starting up.

That might be a slightly unnecessary step, and could be decided upstream - the user selects what narrative to import data into in the DTS service, and it could do the copying to staging and start the jobs. Job monitoring, error handling, and restarts might need to get handled elsewhere - possibly in some modifications to the job browser or something in the narrative.

ialarmedalien commented 3 months ago

A few things we need to consider with whatever solutions are proposed:

jeff-cohere commented 3 months ago

That might be a slightly unnecessary step, and could be decided upstream - the user selects what narrative to import data into in the DTS service, and it could do the copying to staging and start the jobs.

It would be ideal if we could avoid sticking KBase-specific things (like narratives) into the DTS, since the DTS isn't a KBase-specific thing, but we could tuck this information into the manifest if we needed to.

jeff-cohere commented 3 months ago

We discussed this at our weekly DTS "heartbeat" meeting, and here's what we'd like to do:

So far, we've been dancing around the search/request process, avoiding discussing it in detail. I think now it's time to start nailing things down. I'd like to put together a Python DTS client that exposes the functionality we need to search for files and request specific IDs for transfer. I think this would help us understand the space of search and request parameters/filters and would allow us to start stress testing the DTS without a lot of UI-specific time and effort.

Does this seem like a reasonable course to yous, @MrCreosote and @briehl ?

MrCreosote commented 3 months ago

The import could be initiated by either a UI event or a polling process.

There are libs that allow you to react to events in a filesystem, like a file getting added to a directory