Closed adf-ncgr closed 1 year ago
From the Data Store root, here's what I see currently:
ls */*/*/* | grep -i busco | sort | uniq -c
97 busco
1 BUSCO_genome_fabales_odb10
1 BUSCO_proteins_fabales_odb10
I think I would prefer not including the database version in the directory name. (When odb11 comes along, would we retain both? Replace odb10?) For that matter, can't "genome" and "proteins" be inferred from the directory context ("genomes" and "annotations", respectively). And - I suspect that we'll probably only use one taxon level (fablales).
So, I think I'd make a modification/counter-proposal:
BUSCO/
full_table.tsv.gz
parameters.txt
short_summary.json
short_summary.txt
Sorry, apparently some of the details of my proposal were treated as markdown and not displayed; now fixed. I think we're mostly in agreement. The only reason I have to possibly retain "lineage" is if we want to (for example) have different taxonomic levels (e.g. fabales and embryophyta) from the same version of odb. I don't think they have strict superset-subset relationships, so it wouldn't be redundant. There may not be strong reasons for having more than one lineage though; for the purposes of completeness evaluation the fabales is best, but for other purposes (e.g. picking subsets of conserved genes that could be used in phylogenomic studies) higher level taxa could be of interest. But, I don't really care that much, so if you strongly prefer to leave it off until such time as it may be warranted, that's fine.
One question, I don't know what the file "parameters.txt" is, I don't see it in my BUSCO output; the json file does have parameters used, however so I think it's probably covered unless I'm missing something.
So far, we've avoided requiring context to know what a file is in the DS. The Ground Principle of Yuck (GPY) is that if you see the file by itself on a deserted island, you know what it is by its name. Your counter-proposal @cann0010 violates that. So I'd vote for a explicit file naming for consistency with the rest of the Datastore.
@sammyjava does that mean you think my proposal wasn't yucky enough? e.g. that instead of BUSCO_fabales_odb10/short_summary.txt we should instead have something like: glyma.Wm82.gnm4.busco.fabales_odb10.short_summary.txt (with or without allowing this and similar BUSCO files to reside in a subfolder)
Extrapolating from Sam's comment ... I would be OK with BUSCO/glyma.Wm82.gnm4.busco.fabales_odb10.short_summary.txt
Regarding "parameters.txt": I made that up. It would simply be a small file that records some information about the run. It might, in fact, be virtually identical across all our BUSCO runs.
Well I was reacting to Steven's very unyucky proposal but, yes, even yours wasn't yucky enough. But now we're in the yuck pocket with the last two comments.
OK, so a BUSCO folder to group things and full yuck naming on files within to create full disambiguation (for when we exile Sam to St. Helena). I will add it to relevant specs and close this.
couldn't find an old issue or anywhere else this had been discussed, but in a nutshell here is my proposal: folders:
each containing files: full_table.tsv.gz short_summary.json short_summary.txt
where the results under annotations collections is run against primary proteins (and under genomes is run against genomes, obviously); lineage would be the indication of what BUSCO set was used (e.g. fabales_odb10).
@cann0010 @ctcncgr