Open jeff-cohere opened 3 months ago
I think this proposal looks reasonable, based on previous discussions, but I do have a couple comments and questions.
IMO step 3 (user selects a manifest file and that makes a spreadsheet) doesn't really need user interaction. If all the information is already present to make the import spec spreadsheet (and thus the bulk import cell), there are a couple options for automation already. One thought is that the staging service can detect that a manifest.json file was uploaded, and can parse it to an import spec automatically without user intervention. Alternately, we could just include the manifest files as allowed Import Specification files (along with csv, tsv, and excel). So the user would just select the manifest.json file, select "Import Specification" as the file type, and have it create the import cell from there. That would take a little modification to the bulk_specification
endpoint in the staging service, but would save the user a step.
I guess that all depends on what's in the instructions
field, though, or what's allowed there. Did you have thoughts on what to put in? Will users have to add things to fill out the required fields for different data types?
Thanks, Bill!
The instructions
field can have whatever we need in it. Because the DTS is not a KBase-specific thing, this field is just a JSON object with whatever content a given organization needs to process the data upon arrival. I'm resisting the temptation to make a Universal Omni-Schema for this purpose, because from where I sit I see the wreckage of many failed attempts at such things for organizing metadata in the sort of files we're moving around. :-)
In the short term, let's just populate the instructions
field with whatever makes it easiest for the staging service/bulk import process. This field should probably be populated by a UI and not a user in the greatness of time, but for the moment we have the luxury of doing what we need to and learning whatever lessons come with our decisions. Does that make sense to you?
Did you have thoughts on what to put in? Will users have to add things to fill out the required fields for different data types?
I think we should definitely include file types and whatever other fields are required as part of the corresponding bulk import spec. I put some user interactivity into Step 3 above to allow a user to override the defaults for fields based on the knowledge in their heads that doesn't reside anywhere else. Much of the information in the non-required fields in the import spec spreadsheets isn't available in the file metadata (at least not the stuff we get from the JGI Data Portal), so I don't think we have the means to populate the other fields in instructions
. Unless we make that part of the UI by which a user requests a transfer, of course. :-) Curious to hear your thoughts on this.
EDIT: I just noticed that I put all the fields in the import template spreadsheets under "required fields" in the descripton of this issue. Zach Crockett and I think that only the file path is required for each of these templates, and that there are reasonable defaults in place for the other fields.
Thanks for the review @briehl. Thoughts on who on the KBase side would be best able to accomplish an MVP on this and how long? 1 sprint?
Much of the information in the non-required fields in the import spec spreadsheets isn't available in the file metadata (at least not the stuff we get from the JGI Data Portal), so I don't think we have the means to populate the other fields in instructions.
I think making a UI for that would be non-trivial. IMO we can leave that blank in the generated spreadsheet, and either alert the user about it on creation, or just let them find out when they try to upload. They can always download that spreadsheet, alter it as they see fit, and re-upload.
Here's a schema-lite of the inputs for the various functions that are relevant to IMG data, like we discussed yesterday, and are the 6 data types listed in your original post above.
In general, I think an instructions block that looks like this would work best:
"instructions": {
"data_type": "genbank_genome",
"parameters": {
"param1": "value1",
"param2": "value2"
...etc
}
}
This will be a mix of tricky / redundant for some of the multi-file types (gff+fasta genome, gff+fasta metagenome, non-interleaved reads), and might contain some redundant information, but that might be ok to start with. I've put a *
as the first character of the required names, and put those on top of the set
Here's the list, and I'll follow up with a more realistic example:
Assembly
data_type: assembly
{
"*staging_file_subdir_path": "str, file path",
"*assembly_name": "str, object_name",
"type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct', 'metagenome']",
"min_contig_length": "int"
}
Genbank genome
data_type: genbank_genome
{
"*staging_file_subdir_path": "str, file path",
"*genome_name": "str, object_name",
"genome_type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct']",
"source": "str, ['RefSeq user', 'Ensembl user', 'Other']",
"release": "str",
"genetic_code": "int",
"scientific_name": "str",
"generate_ids_if_needed": "str",
"generate_missing_genes": "str"
}
GFF+FASTA genome
data_type: gff_genome
{
"*fasta_file": "str, file path",
"*gff_file": "str, file path",
"*genome_name": "str, object_name",
"genome_type": "str, ['draft isolate', 'finished isolate', 'fungi', 'mag', 'other Eukaryote', 'plant', 'sag', 'virus', 'plasmid', 'construct']",
"scientific_name": "str",
"source": "str, ['RefSeq user', 'Ensembl user', 'JGI', 'Other']",
"taxon_wsname": "str",
"release": "str",
"genetic_code": "int",
"generate_missing_genes": "str"
}
GFF+FASTA metagenome
data_type: gff_metagenome
{
"*fasta_file": "str, file path",
"*gff_file": "str, file path",
"*genome_name": "str, object_name",
"source": "str, ['EBI user', 'IMG user', 'JGI user', 'BGI user', 'Other']",
"release": "str",
"genetic_code": "int",
"generate_missing_genes": "str"
}
Interleaved FASTQ reads
data_type: fastq_reads_interleaved
{
"*fastq_fwd_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
}
Noninterleaved FASTQ reads
data_type: fastq_reads_noninterleaved
{
"*fastq_fwd_staging_file_name": "str, file path",
"*fastq_rev_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
}
SRA reads
data_type: sra_reads
{
"*sra_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
}
A simple example of just the required values for, say, a genbank genome would be:
"instructions": {
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
}
}
There's a couple additional things here.
Thanks for the sketch, @briehl ! I think the redundancy is just fine, and I like that you've pushed the unstructured part of the schema into a parameters
field.
What do you think about payloads that include several objects? A while back Zach did some digging and found some use cases that had 1-2 objects and some that had significantly more. How do you feel about something like this?
"instructions": {
"protocol": "KBase narrative import",
"objects": [
{
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
}
},
...
{
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_other_genome.gbk",
"genome_name": "some_other_genome"
}
}
]
}
(I added a protocol
field that can be used to indicate that the creator has promised to create a compliant set of instructions
.)
This structure implies, of course, that we allow more than one data type per payload. In principle this is nice and makes all kinds of things easy, but I'm not sure how it looks in practice.
I think that might work? I kinda figured that each file in the resources list would have its own set of instructions, which admittedly might get weird if, for example, a GFF file referred to both the GFF and FASTA files in instructions, and so does the matching FASTA file.
Does this mean putting the instructions
field in its own separate field next to resources
?
Does this mean putting the instructions field in its own separate field next to resources?
Yes, this was my original intent. Sorry if I wasn't clear about that! Early on, we were thinking that we would restrict file transfers to single file types for simplicity. But if it's practical to Import All The Things in one manifest, the instructions
field is a single place that annotates all of the resources.
We're trying to make the DTS independent of any organization's internal workings, and separating instructions
into its own field provides a simple structure that can be swapped out based on the destination of a file transfer. Also, it avoids the kind of weirdness you referred to when multiple files in the payload are associated via import instructions.
Let me know if you want to chat more about this, and do feel free to change what I've proposed in the instructions
object.
Also, my example above doesn't take advantage of the fact that the manifest includes the resources
array with all the relevant file paths. We should feel free to refer to the data resources in that array by name (or ID or whatever) in the instructions
field if it helps.
I see! Yeah, that makes sense. So, just to summarize:
So the structure is (roughly):
{
"name": "manifest",
"resources": [{
"id": "foo",
"name": "bar",
...etc
}, ...],
"instructions": {
"protocol": "KBase narrative import",
"objects": [...what you put above]
}
}
Yeah, I agree that we could do some kinds of referencing to data sources from within the instructions field, but I'm not sure how best to do that that would be easy to interpret.
Still, this makes sense, and should in fact be easier to process. I misinterpreted having the instructions be attached to each file resource.
One thing that could be done, then, is to just put all the parameters of each data type as a list, so something like:
"instructions": {
"protocol": "KBase narrative import",
"objects": {
"genbank_genome": [
{
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
},
{
"staging_file_subdir_path": "path/to/some_other_genome.gbk",
"genome_name": "some_other_genome"
}
]
}
}
But I see the value in having a list of discrete objects, too, to keep them independent and to have an explicit data_type
field.
Still, I think it's practical to import All The Things at one time, as long as the data types exist for them. In fact, the parser is set up to do just that. I'll just tweak it to pull from a separate single field, which will in fact make it simpler!
This is a proposal to modify the KBase file staging service and its UI to recognize and parse DTS manifests and import their contents to KBase Narratives. The CE/DTS team believes this is the simplest approach to getting data into KBase from other organizations using the DTS.
Proposed Workflow
manifest.json
file with file metadata and aninstructions
field conveying any information needed by the file staging service to import the files into a KBase Narrative.manifest.json
file within the payload folder. This triggers a process within the file staging service to generate a bulk import spreadsheet appropriate for the payload. This process uses theinstructions
field withinmanifest.json
to determine the appropriate file type for the spreadsheet, etc, and extracts the names of the files, using them to populate the generated spreadsheet, which appears in the user's staging directory, either within the DTS payload folder or at the top level.The
instructions
field in the DTS manifest is itself a JSON object that can store any required information, so its content can be tailored to any of the supported import spreadsheet templates. Down the line, we may develop something more structured if it makes sense, but we can assume for the moment that a KBase-Narrative-specific set of instructions is acceptable.Notes From Previous Discussions
Some recent changes introduce an
instructions
field in the DTS manifest that contains a JSON object that we can use to convey information about how the payload should be treated once it reaches its destination. This issue is an exploration of a "protocol" we might use to import the contents of a payload to a KBase narrative. This import process needs information from the requesting user because the metadata associated with the files in a payload is clearly insufficient for figuring out how the content should be translated into a narrative.To be clear: the specification of any such protocol is not part of the DTS itself--this
instructions
field is simply a feature offered by the DTS to support the creation of these protocols. But this seems like a reasonable place to record our thoughts.Provocative Questions
instructions
field, say), we can write a program that monitors the KBase file staging area for payloads, parses manifests, and generates import template spreadsheets with these defaults inserted. This might be all we need to get off the ground.instructions
field in the manifest can refer to an aribtrary JSON object. In the case of a KBase payload to be imported to a narrative, we can have a JSON object with a singletype
field, whose value can be one of (say)assembly
,genbank
,gff-fasta
,fastq-interleaved
,fastq-noninterleaved
,sra
.Types of KBase payloads
Assembly import
Required fields for each file:
staging_file_subdir_path
: FASTA file pathassembly_name
type
min_contig_length
TSV import template
Genbank genome import
Required fields for each file:
staging_file_subdir_path
genome_name
genome_type
source
scientific_name
release
genetic_code
generate_ids_if_needed
generate_missing_genes
TSV import template
GFF+FASTA genome and metagenome import
Required fields for each file:
fasta_file
gff_file
genome_name
source
release
genetic_code
generate_missing_genes
TSV import template
Interleaved FASTQ reads import
Required fields for each file:
fastq_fwd_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
Noninterleaved FASTQ reads import
Required fields for each file pair:
fastq_fwd_staging_file_name
fastq_rev_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
SRA reads import
Required fields for each file:
sra_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
References