jeff-cohere commented 3 months ago

This is a proposal to modify the KBase file staging service and its UI to recognize and parse DTS manifests and import their contents to KBase Narratives. The CE/DTS team believes this is the simplest approach to getting data into KBase from other organizations using the DTS.

Proposed Workflow

A KBase user transfers data to KBase using the DTS. The DTS deposits a payload into the user's KBase staging directory, which is visible to the file staging service and the bulk import UI. The payload consists of a folder (named after the UUID of the DTS transfer) containing the files transferred with all relative paths preserved from their source folders, plus a machine-readable manifest.json file with file metadata and an instructions field conveying any information needed by the file staging service to import the files into a KBase Narrative.
The KBase user navigates to their staging directory within the bulk import UI, and selects the folder representing the payload.
The KBase user selects the manifest.json file within the payload folder. This triggers a process within the file staging service to generate a bulk import spreadsheet appropriate for the payload. This process uses the instructions field within manifest.json to determine the appropriate file type for the spreadsheet, etc, and extracts the names of the files, using them to populate the generated spreadsheet, which appears in the user's staging directory, either within the DTS payload folder or at the top level.
The KBase user then imports the files into a KBase Narrative in the usual manner, using the generated spreadsheet.

The instructions field in the DTS manifest is itself a JSON object that can store any required information, so its content can be tailored to any of the supported import spreadsheet templates. Down the line, we may develop something more structured if it makes sense, but we can assume for the moment that a KBase-Narrative-specific set of instructions is acceptable.

Notes From Previous Discussions

Some recent changes introduce an instructions field in the DTS manifest that contains a JSON object that we can use to convey information about how the payload should be treated once it reaches its destination. This issue is an exploration of a "protocol" we might use to import the contents of a payload to a KBase narrative. This import process needs information from the requesting user because the metadata associated with the files in a payload is clearly insufficient for figuring out how the content should be translated into a narrative.

To be clear: the specification of any such protocol is not part of the DTS itself--this instructions field is simply a feature offered by the DTS to support the creation of these protocols. But this seems like a reasonable place to record our thoughts.

Provocative Questions

How much can we expect a UI to help a user "fill out" the import instructions for KBase? Can we do anything more than signaling to KBase that we know what kind of import template to generate? It turns out most of the import templates have parameter defaults. As long as we include files of only a single type in a payload and encode the file type in the payload's manifest (via the instructions field, say), we can write a program that monitors the KBase file staging area for payloads, parses manifests, and generates import template spreadsheets with these defaults inserted. This might be all we need to get off the ground.
How should the file type be encoded in the manifest? The optional instructions field in the manifest can refer to an aribtrary JSON object. In the case of a KBase payload to be imported to a narrative, we can have a JSON object with a single type field, whose value can be one of (say) assembly, genbank, gff-fasta, fastq-interleaved, fastq-noninterleaved, sra.

Types of KBase payloads

Assembly
Genbank genome
GFF+FASTA genome / metagenome
Interleaved FASTQ reads
Noninterleaved FASTQ reads
SRA reads

Assembly import

Required fields for each file:

staging_file_subdir_path: FASTA file path
assembly_name
type
min_contig_length

TSV import template

Genbank genome import

Required fields for each file:

staging_file_subdir_path
genome_name
genome_type
source
scientific_name
release
genetic_code
generate_ids_if_needed
generate_missing_genes

TSV import template

GFF+FASTA genome and metagenome import

Required fields for each file:

fasta_file
gff_file
genome_name
source
release
genetic_code
generate_missing_genes

TSV import template

Interleaved FASTQ reads import

Required fields for each file:

fastq_fwd_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean

TSV import template

Noninterleaved FASTQ reads import

Required fields for each file pair:

fastq_fwd_staging_file_name
fastq_rev_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean

TSV import template

SRA reads import

Required fields for each file:

sra_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean

TSV import template

References

briehl commented 2 months ago

I think this proposal looks reasonable, based on previous discussions, but I do have a couple comments and questions.

IMO step 3 (user selects a manifest file and that makes a spreadsheet) doesn't really need user interaction. If all the information is already present to make the import spec spreadsheet (and thus the bulk import cell), there are a couple options for automation already. One thought is that the staging service can detect that a manifest.json file was uploaded, and can parse it to an import spec automatically without user intervention. Alternately, we could just include the manifest files as allowed Import Specification files (along with csv, tsv, and excel). So the user would just select the manifest.json file, select "Import Specification" as the file type, and have it create the import cell from there. That would take a little modification to the bulk_specification endpoint in the staging service, but would save the user a step.

I guess that all depends on what's in the instructions field, though, or what's allowed there. Did you have thoughts on what to put in? Will users have to add things to fill out the required fields for different data types?

jeff-cohere commented 2 months ago

Thanks, Bill!

The instructions field can have whatever we need in it. Because the DTS is not a KBase-specific thing, this field is just a JSON object with whatever content a given organization needs to process the data upon arrival. I'm resisting the temptation to make a Universal Omni-Schema for this purpose, because from where I sit I see the wreckage of many failed attempts at such things for organizing metadata in the sort of files we're moving around. :-)

In the short term, let's just populate the instructions field with whatever makes it easiest for the staging service/bulk import process. This field should probably be populated by a UI and not a user in the greatness of time, but for the moment we have the luxury of doing what we need to and learning whatever lessons come with our decisions. Does that make sense to you?

jeff-cohere commented 2 months ago

Did you have thoughts on what to put in? Will users have to add things to fill out the required fields for different data types?

I think we should definitely include file types and whatever other fields are required as part of the corresponding bulk import spec. I put some user interactivity into Step 3 above to allow a user to override the defaults for fields based on the knowledge in their heads that doesn't reside anywhere else. Much of the information in the non-required fields in the import spec spreadsheets isn't available in the file metadata (at least not the stuff we get from the JGI Data Portal), so I don't think we have the means to populate the other fields in instructions. Unless we make that part of the UI by which a user requests a transfer, of course. :-) Curious to hear your thoughts on this.

EDIT: I just noticed that I put all the fields in the import template spreadsheets under "required fields" in the descripton of this issue. Zach Crockett and I think that only the file path is required for each of these templates, and that there are reasonable defaults in place for the other fields.

elishawc commented 2 months ago

Thanks for the review @briehl. Thoughts on who on the KBase side would be best able to accomplish an MVP on this and how long? 1 sprint?

briehl commented 2 months ago

Much of the information in the non-required fields in the import spec spreadsheets isn't available in the file metadata (at least not the stuff we get from the JGI Data Portal), so I don't think we have the means to populate the other fields in instructions.

I think making a UI for that would be non-trivial. IMO we can leave that blank in the generated spreadsheet, and either alert the user about it on creation, or just let them find out when they try to upload. They can always download that spreadsheet, alter it as they see fit, and re-upload.

briehl commented 2 weeks ago

Here's a schema-lite of the inputs for the various functions that are relevant to IMG data, like we discussed yesterday, and are the 6 data types listed in your original post above.

In general, I think an instructions block that looks like this would work best:

"instructions": {
    "data_type": "genbank_genome",
    "parameters": {
        "param1": "value1",
        "param2": "value2"
        ...etc
    }
}

This will be a mix of tricky / redundant for some of the multi-file types (gff+fasta genome, gff+fasta metagenome, non-interleaved reads), and might contain some redundant information, but that might be ok to start with. I've put a * as the first character of the required names, and put those on top of the set

Here's the list, and I'll follow up with a more realistic example:

Assembly data_type: assembly

{
    "*staging_file_subdir_path": "str, file path",
    "*assembly_name": "str, object_name",
    "type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct', 'metagenome']",
    "min_contig_length": "int"
}

Genbank genome data_type: genbank_genome

{
    "*staging_file_subdir_path": "str, file path",
    "*genome_name": "str, object_name",
    "genome_type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct']",
    "source": "str, ['RefSeq user', 'Ensembl user', 'Other']",
    "release": "str",
    "genetic_code": "int",
    "scientific_name": "str",
    "generate_ids_if_needed": "str",
    "generate_missing_genes": "str"
}

GFF+FASTA genome data_type: gff_genome

{
    "*fasta_file": "str, file path",
    "*gff_file": "str, file path",
    "*genome_name": "str, object_name",
    "genome_type": "str, ['draft isolate', 'finished isolate', 'fungi', 'mag', 'other Eukaryote', 'plant', 'sag', 'virus', 'plasmid', 'construct']",
    "scientific_name": "str",
    "source": "str, ['RefSeq user', 'Ensembl user', 'JGI', 'Other']",
    "taxon_wsname": "str",
    "release": "str",
    "genetic_code": "int",
    "generate_missing_genes": "str"
}

GFF+FASTA metagenome data_type: gff_metagenome

{
    "*fasta_file": "str, file path",
    "*gff_file": "str, file path",
    "*genome_name": "str, object_name",
    "source": "str, ['EBI user', 'IMG user', 'JGI user', 'BGI user', 'Other']",
    "release": "str",
    "genetic_code": "int",
    "generate_missing_genes": "str"
}

Interleaved FASTQ reads data_type: fastq_reads_interleaved

{
    "*fastq_fwd_staging_file_name": "str, file path",
    "*name": "str, object_name",
    "sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
    "single_genome": "str",
    "read_orientation_outward": "str",
    "insert_size_std_dev": "float",
    "insert_size_mean": "float"

}

Noninterleaved FASTQ reads data_type: fastq_reads_noninterleaved

{
    "*fastq_fwd_staging_file_name": "str, file path",
    "*fastq_rev_staging_file_name": "str, file path",
    "*name": "str, object_name",
    "sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
    "single_genome": "str",
    "read_orientation_outward": "str",
    "insert_size_std_dev": "float",
    "insert_size_mean": "float"
}

SRA reads data_type: sra_reads

{
    "*sra_staging_file_name": "str, file path",
    "*name": "str, object_name",
    "sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
    "single_genome": "str",
    "read_orientation_outward": "str",
    "insert_size_std_dev": "float",
    "insert_size_mean": "float"
}

A simple example of just the required values for, say, a genbank genome would be:

"instructions": {
    "data_type": "genbank_genome",
    "parameters": {
        "staging_file_subdir_path": "path/to/some_genome.gbk",
        "genome_name": "some_genome"
    }
}

briehl commented 2 weeks ago

There's a couple additional things here.

The non-file parameters are all optional, and have reasonable defaults. Also, very easy for users to set during the import process. So I don't think it's worth worrying about those.
The object names currently don't get auto populated if they're missing. I think they could, generally, just be the id of the file source and users can change them to whatever they want.
The staging service, to keep as agnostic as it can, has no idea what these parameters are and how they're supposed to be used, and I'd rather not add information that if we can help it.

jeff-cohere commented 2 weeks ago

Thanks for the sketch, @briehl ! I think the redundancy is just fine, and I like that you've pushed the unstructured part of the schema into a parameters field.

What do you think about payloads that include several objects? A while back Zach did some digging and found some use cases that had 1-2 objects and some that had significantly more. How do you feel about something like this?

"instructions": {
    "protocol": "KBase narrative import",
    "objects": [
        {
            "data_type": "genbank_genome",
            "parameters": {
                "staging_file_subdir_path": "path/to/some_genome.gbk",
                "genome_name": "some_genome"
            }
        },
        ...
        {
            "data_type": "genbank_genome",
            "parameters": {
                "staging_file_subdir_path": "path/to/some_other_genome.gbk",
                "genome_name": "some_other_genome"
            }
        }
    ]   
}

(I added a protocol field that can be used to indicate that the creator has promised to create a compliant set of instructions.)

This structure implies, of course, that we allow more than one data type per payload. In principle this is nice and makes all kinds of things easy, but I'm not sure how it looks in practice.

briehl commented 2 weeks ago

I think that might work? I kinda figured that each file in the resources list would have its own set of instructions, which admittedly might get weird if, for example, a GFF file referred to both the GFF and FASTA files in instructions, and so does the matching FASTA file.

Does this mean putting the instructions field in its own separate field next to resources?

jeff-cohere commented 2 weeks ago

Does this mean putting the instructions field in its own separate field next to resources?

Yes, this was my original intent. Sorry if I wasn't clear about that! Early on, we were thinking that we would restrict file transfers to single file types for simplicity. But if it's practical to Import All The Things in one manifest, the instructions field is a single place that annotates all of the resources.

We're trying to make the DTS independent of any organization's internal workings, and separating instructions into its own field provides a simple structure that can be swapped out based on the destination of a file transfer. Also, it avoids the kind of weirdness you referred to when multiple files in the payload are associated via import instructions.

Let me know if you want to chat more about this, and do feel free to change what I've proposed in the instructions object.

jeff-cohere commented 2 weeks ago

Also, my example above doesn't take advantage of the fact that the manifest includes the resources array with all the relevant file paths. We should feel free to refer to the data resources in that array by name (or ID or whatever) in the instructions field if it helps.

briehl commented 2 weeks ago

I see! Yeah, that makes sense. So, just to summarize:

the manifest resources list stays as it is, a list of information about each file
the instructions field is unique for each target organization (I like the "protocol" field there), so for processing purposes that should be the only place that an org really needs to look.

So the structure is (roughly):

{
    "name": "manifest",
    "resources": [{
      "id": "foo",
      "name": "bar",
      ...etc
    }, ...],
    "instructions": {
        "protocol": "KBase narrative import",
        "objects": [...what you put above]
    }
}

Yeah, I agree that we could do some kinds of referencing to data sources from within the instructions field, but I'm not sure how best to do that that would be easy to interpret.

Still, this makes sense, and should in fact be easier to process. I misinterpreted having the instructions be attached to each file resource.

One thing that could be done, then, is to just put all the parameters of each data type as a list, so something like:

"instructions": {
    "protocol": "KBase narrative import",
    "objects": {
        "genbank_genome": [
            {
                "staging_file_subdir_path": "path/to/some_genome.gbk",
                "genome_name": "some_genome"
            }, 
            {
                "staging_file_subdir_path": "path/to/some_other_genome.gbk",
                "genome_name": "some_other_genome"
            }
        ]
    }
}

But I see the value in having a list of discrete objects, too, to keep them independent and to have an explicit data_type field. Still, I think it's practical to import All The Things at one time, as long as the data types exist for them. In fact, the parser is set up to do just that. I'll just tweak it to pull from a separate single field, which will in fact make it simpler!

kbase / dts

Proposal: Importing a DTS Payload to a KBase Narrative #79

Proposed Workflow

Notes From Previous Discussions

Provocative Questions

Types of KBase payloads

Assembly import

Genbank genome import

GFF+FASTA genome and metagenome import

Interleaved FASTQ reads import

Noninterleaved FASTQ reads import

SRA reads import

References