Closed sammyjava closed 1 year ago
An alternative (which I've also been doing) is to have a list of prefixes. One can get rid of the "supercontig prefix match overrides chromosome prefix match" rule with the following:
chromosome_prefix: Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8
supercontig_prefix: Chr0
This is probably a better way of doing it since no precedence rule is required. There are very few cases where we need to explicitly list the chromosome prefixes because they overlap supercontig prefixes.
Or, maybe we want to list complete prefixes rather than matching substrings:
chromosome_prefix: MtrunA17Chr1,MtrunA17Chr2,MtrunA17Chr3,MtrunA17Chr4,MtrunA17Chr5,MtrunA17Chr6,MtrunA17Chr7,MtrunA17Chr8
supercontig_prefix: MtrunA17Chr0c
Another possible option to consider is adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be. This would be similar (in principle) to what NCBI does in their assembly report files, e.g.: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/473/485/GCF_003473485.1_MtrunA17r5.0-ANR/GCF_003473485.1_MtrunA17r5.0-ANR_assembly_report.txt note that in that example, they give the chromosome and mitochondrial sequences their own special types (as is right and proper). I'm not proposing we follow their structure slavishly, but having a metadata lookup table in principle seems like a reasonable way to handle such things .
PS. I feel your pain on sticky wicky scaffold names like Gm06_scaffold_99
Another possible option to consider is adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be...
Fine with me, I'd just like to know which sequences are chromosomes and which are supercontigs from the collection. In your case I'd have to make validation require that seqid_map.tsv.gz be present, it's not required at the moment.
I think I favor "adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be" We would probably want to support/anticipate several terms: contig, scaffold (==supercontig), chromosome http://www.sequenceontology.org/browser/current_svn/term/SO:0000148 http://www.sequenceontology.org/browser/current_svn/term/SO:0000340 I would include "chromosome_prefix" and "scaffold_prefix" in the ds_souschef yaml; that information would then propagate into the READMEs and the seqid_map.tsv files. The genome seqid_map files would then look like:
Chr19 glyma.JiDouNo_17.gnm1.Chr19 chromosome
Chr20 glyma.JiDouNo_17.gnm1.Chr20 chromosome
HiC_scaffold_1 glyma.JiDouNo_17.gnm1.HiC_scaffold_1 scaffold
HiC_scaffold_10 glyma.JiDouNo_17.gnm1.HiC_scaffold_10 scaffold
Also: I don't think we should be above tweaking some scaffold and chromosome prefixes in our Data Store version. For example, for genomes and annotations that run from, say, chr0..chr12, I have been mapping those to chr1..chr13 (reflected in the seqid_map.tsv).
Well, you two have converted my proposal, which would take me about an hour to implement, to a Much Bigger Job, since we only have 31 genome collections containing seqid_map.tsv files and 136 that don't. In other words "add a new column to files that don't actually exist."
So I'm gonna just keep on doing what I'm doing and let you two take it over if you want. I'm definitely not offering to build 136 seqid_map.tsv files and update the lines in the other 31. But I look forward to when the Datastore DOES contain all of those files so I can switch over to using them. :)
Is it possible to distinguish chromosomes vs. scaffolds for all genome collections by defining an extended regular expression for each in the collection's README?
@cann0010 I am on board with this, but the last example you gave makes me a little nervous, given that "chr0" typically has a "grabbag" semantics that "chr13" would not have. But are you saying that there are cases where people have actually applied chr0 to a bonafide chr that everyone else is calling chr1 (or whatever)? That sounds hilariously probable.
@sammyjava sounds fine, we will take it from here and backfill as needed until we tell you this is a validate-able required file for the collection @nathanweeks I think that's more or less what Sam was originally proposing (and what Steven intends to put into the souschef so that it comes out as gospel truth in a table)
Is it possible to distinguish chromosomes vs. scaffolds for all genome collections by defining an extended regular expression for each in the collection's README?
It doesn't have to be as fancy as a regex, just listing the prefixes as I proposed does the trick. But the more exact listing in seqid_map.tsv is certainly preferable, and simpler, when it exists which @adf-ncgr has offered to make happen. There's no massive urgency on my end, I already have all the prefixes listed for my purposes.
@adf-ncgr - "there are cases where people have actually applied chr0 to a bonafide chr that everyone else is calling chr1" Yep. One place I saw it was in the new (private) Quillaja assembly.
A variant problem: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9052405/ Their "chromosome" assemblies are named "scaf001" through "scaf014". I have prepared this assembly and annotation, though it is still in "private."
@sammyjava - I like your solution (prefixes indicated in the README) -- and agree that it should be sufficient. It's just that I might as well get those two fields into the upstream yaml for preparing the assemblies and annotations; and then might as well get that information into the seqid_map.tsv files, at least prospectively.
I've updated datastore-specifications/Genus/species/genomes to include the third column in seqid_map.tsv (on its own page).
As for the README prefixes, it sounds like no one objects (this being an RFO). If that is truly the case, I'll add them to the spec as genome collection-specific README fields and to all the READMEs. Going once...going twice...
If a simple prefix is enough to distinguish all chromosomes vs. scaffolds (and "sticky-wicky scaffold names like Gm06_scaffold_99" are purely hypothetical) then that's fine. One use case I had in mind for using a regex was to aid in the creation of default JBrowse URLs without having to read an additional data file that doesn't exist in the datastore-metadata repo (e.g., the regex /glyma.Wm82_ISU01.gnm2.Gm[0-2][0-9]/ implies that the first chromosome is "glyma.Wm82_ISU01.gnm2.Gm01" , rather than "glyma.Wm82_ISU01.gnm2.Gm1"). But that's admittedly quite niche...
FWIW, the "Gm06_scaffold_99" was not purely hypothetical; this is how some scaffolds are named in the Phytozome version of what we call glyma.Wm82.gnm4 (well, I made up that particular name but "Gm13_scaffold_21" is a real one). Looks like someone recorded their displeasure with this scheme in the CHANGES file for this dataset: perl -pi -e 's/^>glyma.Wm82.gnm4.Gm\d\dscaffold/>glyma.Wm82.gnm4.scaffold_/' glyma.Wm82.gnm4.4PTR.genome_main.fna
this was in pre-ds_souschef world, although the file glyma.Wm82.gnm4.4PTR.info_name_hash.txt tells us: glyma.Wm82.gnm4.scaff_21 Gm13_scaffold_21 which is itself a little odd as the final given name was actually glyma.Wm82.gnm4.scaffold_21
In any case, I believe we're converging on a solution to "what they called it -> what we call it" that we can all abide by (or at least all agree to abide by). But really I think we eventually should just let ChatGPT rename chromosomes for us (as suggested by your idea to use a regex as a generative grammar)
Additional thoughts, after looking through all 136 genome_main.fna.gz.fai files: (1) We want to distinguish between "chromosomes" and "other stuff." The other stuff consists of various things: supercontigs, contigs, mitochondrial and chloroplast genomes, (inappropriately) merged leftover sequence. So, we may be better-off just recording patterns that correspond with chromosomes - i.e. only adding "chromosome_pattern: " (2) A simple prefix will be sufficient in most cases, but not all. A regex may be required. Here are two pathological cases (where a prefix isn't sufficient):
lupal.Amiga.gnm1.Lalb_Chr24
lupal.Amiga.gnm1.Lalb_Chr25
lupal.Amiga.gnm1.Lalb_Chr00c01
lupal.Amiga.gnm1.Lalb_Chr00c02
vitvi.PN40024.gnm2.chr8
vitvi.PN40024.gnm2.chr9
vitvi.PN40024.gnm2.chr9_random
vitvi.PN40024.gnm2.chrUn
That's why I allow lists of prefixes. In your first case, I list it like this:
chromosome.lupal.Amiga.gnm1.F4NR=Lalb_Chr01,Lalb_Chr02,Lalb_Chr03,Lalb_Chr04,Lalb_Chr05,Lalb_Chr06,Lalb_Chr07,Lalb_Chr08,Lalb_Chr09,Lalb_Chr1,Lalb_Chr2
supercontig.lupal.Amiga.gnm1.F4NR=Lalb_Chr00c
I don't load vitvi so I'm not going to worry about it. Those names are just silly.
I'm going to close this since I'm moving forward with it for mine-loading purposes. We can update the spec down the line when someone has an actual application that needs a different implementation. I'll update the README spec for GENUS/species to list the new properties. @cann0010 : I load chromosomes and supercontigs into the mines. Many assemblies ONLY have supercontigs, so we should identify those. I'm doing the work so I'll stick with both. :)
While I fully endorse closing the issue, I'll note that I think the list-o-prefixes implementation does not address cases like "Gm06_scaffold_99" which is fairly similar to the vitvi silliness; as I observed earlier, this has in some manner been resolved by judicious renaming policies, but if we are giving ourselves the latitude to judiciously rename things we could simply go all the way and just fully standardize chromosome nomenclature (as Ensembl does, I think)? In part, this is what having the full monty seq_id_map files is supposed to let us feel OK about. However, if we think it's not just chromosomes vs not-chromosomes (a not-SO term), then I still think an explicit tabulation will be preferable in the long run. But I should probably just be happy to be off the hook for making said explicit tabulations...
I think we agreed that you two @adf-ncgr and @cann0010 would take charge of implementing seqid_map files which contain the SO classification of the assembly sequences across the board. I already know that the prefix hack works, because I use it to load all of the mines. I'm just putting that data into the READMEs, rather than in my secret file. When the seqid_map files are in every genome collection, and are required, then we can drop the prefixes from the READMEs.
OK, I'll make a different datastore issue for actually getting this done, then.
I think we can agree that most LIS genome assemblies contain large pseudomolecules we call "chromosomes" and many smaller contigs or scaffolds or whatever you like to call them. But there is nothing in the genome collections that indicates which sequences are which. Nevertheless, I load Chromosomes and Supercontig mine objects from those collections. Therefore, over the years, I have maintained a mine-loading helper file which provides prefixes (which I glean by looking at zgrep ">" on the FASTAs) to identify the chromosomes and, in cases where there is overlap, to identify the supercontigs (see below for a case where this is critical).
This task catches me constantly - in fact, I just blew away a GlycineMine load that had been running for three days because it wound up loading glyma.Wm82.gnm4 scaffolds as chromosomes.
Here's a simple, typical example:
I suggest adding to that collection's README the following:
(An added functionality from this is that if a sequence does NOT match a chromosome or supercontig prefix, it doesn't get loaded into the mine. Maybe we don't want to load supercontigs for some genomes.)
A stickier wicket comes up when you have overlap between chromosome and supercontig names, such as this case:
In this case, I suggest the README contain the following:
Then I will have the supercontig match have precedence over the chromosome match during loading (which is what I currently do). (Note in this case the mitochondrial and chloroplast sequences match neither and would not be loaded. I suppose we could add directives for those as well if we want to load them.)
I'll take care of updating all the READMEs if we go with this. It makes more sense to me to have them in the genome collections rather than a local mine-loading file that I have to maintain separately. I also think it adds information to the genome collections.