legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: BUSCO data #37

Closed adf-ncgr closed 1 year ago

adf-ncgr commented 1 year ago

couldn't find an old issue or anywhere else this had been discussed, but in a nutshell here is my proposal: folders:

genomes/<collection>/BUSCO_<lineage>/
annotations/<collection>/BUSCO_<lineage>/

each containing files: full_table.tsv.gz short_summary.json short_summary.txt

where the results under annotations collections is run against primary proteins (and under genomes is run against genomes, obviously); lineage would be the indication of what BUSCO set was used (e.g. fabales_odb10).

@cann0010 @ctcncgr

StevenCannon-USDA commented 1 year ago

From the Data Store root, here's what I see currently:

ls */*/*/* | grep -i busco | sort | uniq -c
  97 busco
   1 BUSCO_genome_fabales_odb10
   1 BUSCO_proteins_fabales_odb10

I think I would prefer not including the database version in the directory name. (When odb11 comes along, would we retain both? Replace odb10?) For that matter, can't "genome" and "proteins" be inferred from the directory context ("genomes" and "annotations", respectively). And - I suspect that we'll probably only use one taxon level (fablales).

So, I think I'd make a modification/counter-proposal:

  BUSCO/
    full_table.tsv.gz
    parameters.txt
    short_summary.json
    short_summary.txt
adf-ncgr commented 1 year ago

Sorry, apparently some of the details of my proposal were treated as markdown and not displayed; now fixed. I think we're mostly in agreement. The only reason I have to possibly retain "lineage" is if we want to (for example) have different taxonomic levels (e.g. fabales and embryophyta) from the same version of odb. I don't think they have strict superset-subset relationships, so it wouldn't be redundant. There may not be strong reasons for having more than one lineage though; for the purposes of completeness evaluation the fabales is best, but for other purposes (e.g. picking subsets of conserved genes that could be used in phylogenomic studies) higher level taxa could be of interest. But, I don't really care that much, so if you strongly prefer to leave it off until such time as it may be warranted, that's fine.

One question, I don't know what the file "parameters.txt" is, I don't see it in my BUSCO output; the json file does have parameters used, however so I think it's probably covered unless I'm missing something.

sammyjava commented 1 year ago

So far, we've avoided requiring context to know what a file is in the DS. The Ground Principle of Yuck (GPY) is that if you see the file by itself on a deserted island, you know what it is by its name. Your counter-proposal @cann0010 violates that. So I'd vote for a explicit file naming for consistency with the rest of the Datastore.

adf-ncgr commented 1 year ago

@sammyjava does that mean you think my proposal wasn't yucky enough? e.g. that instead of BUSCO_fabales_odb10/short_summary.txt we should instead have something like: glyma.Wm82.gnm4.busco.fabales_odb10.short_summary.txt (with or without allowing this and similar BUSCO files to reside in a subfolder)

StevenCannon-USDA commented 1 year ago

Extrapolating from Sam's comment ... I would be OK with BUSCO/glyma.Wm82.gnm4.busco.fabales_odb10.short_summary.txt

Regarding "parameters.txt": I made that up. It would simply be a small file that records some information about the run. It might, in fact, be virtually identical across all our BUSCO runs.

sammyjava commented 1 year ago

Well I was reacting to Steven's very unyucky proposal but, yes, even yours wasn't yucky enough. But now we're in the yuck pocket with the last two comments.

adf-ncgr commented 1 year ago

OK, so a BUSCO folder to group things and full yuck naming on files within to create full disambiguation (for when we exile Sam to St. Helena). I will add it to relevant specs and close this.