legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: add chromosome_prefix / supercontig_prefix to genome READMEs #38

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

I think we can agree that most LIS genome assemblies contain large pseudomolecules we call "chromosomes" and many smaller contigs or scaffolds or whatever you like to call them. But there is nothing in the genome collections that indicates which sequences are which. Nevertheless, I load Chromosomes and Supercontig mine objects from those collections. Therefore, over the years, I have maintained a mine-loading helper file which provides prefixes (which I glean by looking at zgrep ">" on the FASTAs) to identify the chromosomes and, in cases where there is overlap, to identify the supercontigs (see below for a case where this is critical).

This task catches me constantly - in fact, I just blew away a GlycineMine load that had been running for three days because it wound up loading glyma.Wm82.gnm4 scaffolds as chromosomes.

Here's a simple, typical example:

>cicre.Besev079.gnm1.chr1
...
>cicre.Besev079.gnm1.chr8
>cicre.Besev079.gnm1.tig00002053
>cicre.Besev079.gnm1.tig00002148
...

I suggest adding to that collection's README the following:

chromosome_prefix: chr
supercontig_prefix: tig

(An added functionality from this is that if a sequence does NOT match a chromosome or supercontig prefix, it doesn't get loaded into the mine. Maybe we don't want to load supercontigs for some genomes.)

A stickier wicket comes up when you have overlap between chromosome and supercontig names, such as this case:

>medtr.A17.gnm5.MtrunA17Chr0c01 len=1205179 acc=MtrunA17Chr0c01
>medtr.A17.gnm5.MtrunA17Chr0c02 len=597968 acc=MtrunA17Chr0c02
...
>medtr.A17.gnm5.MtrunA17Chr0c31 len=23291 acc=MtrunA17Chr0c31
>medtr.A17.gnm5.MtrunA17Chr0c32 len=22751 acc=MtrunA17Chr0c32
>medtr.A17.gnm5.MtrunA17Chr1 len=56706830 acc=MtrunA17Chr1
>medtr.A17.gnm5.MtrunA17Chr2 len=51972579 acc=MtrunA17Chr2
...
>medtr.A17.gnm5.MtrunA17Chr7 len=56236587 acc=MtrunA17Chr7
>medtr.A17.gnm5.MtrunA17Chr8 len=49719271 acc=MtrunA17Chr8
>medtr.A17.gnm5.MtrunA17CP len=124033 acc=NC_003119.6
>medtr.A17.gnm5.MtrunA17MT len=271618 acc=KT971339.1

In this case, I suggest the README contain the following:

chromosome_prefix: Chr
supercontig_prefix: Chr0

Then I will have the supercontig match have precedence over the chromosome match during loading (which is what I currently do). (Note in this case the mitochondrial and chloroplast sequences match neither and would not be loaded. I suppose we could add directives for those as well if we want to load them.)

I'll take care of updating all the READMEs if we go with this. It makes more sense to me to have them in the genome collections rather than a local mine-loading file that I have to maintain separately. I also think it adds information to the genome collections.

sammyjava commented 1 year ago

An alternative (which I've also been doing) is to have a list of prefixes. One can get rid of the "supercontig prefix match overrides chromosome prefix match" rule with the following:

chromosome_prefix: Chr1,Chr2,Chr3,Chr4,Chr5,Chr6,Chr7,Chr8
supercontig_prefix: Chr0

This is probably a better way of doing it since no precedence rule is required. There are very few cases where we need to explicitly list the chromosome prefixes because they overlap supercontig prefixes.

sammyjava commented 1 year ago

Or, maybe we want to list complete prefixes rather than matching substrings:

chromosome_prefix: MtrunA17Chr1,MtrunA17Chr2,MtrunA17Chr3,MtrunA17Chr4,MtrunA17Chr5,MtrunA17Chr6,MtrunA17Chr7,MtrunA17Chr8
supercontig_prefix: MtrunA17Chr0c
adf-ncgr commented 1 year ago

Another possible option to consider is adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be. This would be similar (in principle) to what NCBI does in their assembly report files, e.g.: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/473/485/GCF_003473485.1_MtrunA17r5.0-ANR/GCF_003473485.1_MtrunA17r5.0-ANR_assembly_report.txt note that in that example, they give the chromosome and mitochondrial sequences their own special types (as is right and proper). I'm not proposing we follow their structure slavishly, but having a metadata lookup table in principle seems like a reasonable way to handle such things .

adf-ncgr commented 1 year ago

PS. I feel your pain on sticky wicky scaffold names like Gm06_scaffold_99

sammyjava commented 1 year ago

Another possible option to consider is adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be...

Fine with me, I'd just like to know which sequences are chromosomes and which are supercontigs from the collection. In your case I'd have to make validation require that seqid_map.tsv.gz be present, it's not required at the moment.

StevenCannon-USDA commented 1 year ago

I think I favor "adding a column to the seqid_map.tsv file that tells us for each entry listed what we want its SO type to be" We would probably want to support/anticipate several terms: contig, scaffold (==supercontig), chromosome http://www.sequenceontology.org/browser/current_svn/term/SO:0000148 http://www.sequenceontology.org/browser/current_svn/term/SO:0000340 I would include "chromosome_prefix" and "scaffold_prefix" in the ds_souschef yaml; that information would then propagate into the READMEs and the seqid_map.tsv files. The genome seqid_map files would then look like:

Chr19   glyma.JiDouNo_17.gnm1.Chr19 chromosome
Chr20   glyma.JiDouNo_17.gnm1.Chr20 chromosome
HiC_scaffold_1  glyma.JiDouNo_17.gnm1.HiC_scaffold_1    scaffold
HiC_scaffold_10 glyma.JiDouNo_17.gnm1.HiC_scaffold_10   scaffold

Also: I don't think we should be above tweaking some scaffold and chromosome prefixes in our Data Store version. For example, for genomes and annotations that run from, say, chr0..chr12, I have been mapping those to chr1..chr13 (reflected in the seqid_map.tsv).

sammyjava commented 1 year ago

Well, you two have converted my proposal, which would take me about an hour to implement, to a Much Bigger Job, since we only have 31 genome collections containing seqid_map.tsv files and 136 that don't. In other words "add a new column to files that don't actually exist."

So I'm gonna just keep on doing what I'm doing and let you two take it over if you want. I'm definitely not offering to build 136 seqid_map.tsv files and update the lines in the other 31. But I look forward to when the Datastore DOES contain all of those files so I can switch over to using them. :)

nathanweeks commented 1 year ago

Is it possible to distinguish chromosomes vs. scaffolds for all genome collections by defining an extended regular expression for each in the collection's README?

adf-ncgr commented 1 year ago

@cann0010 I am on board with this, but the last example you gave makes me a little nervous, given that "chr0" typically has a "grabbag" semantics that "chr13" would not have. But are you saying that there are cases where people have actually applied chr0 to a bonafide chr that everyone else is calling chr1 (or whatever)? That sounds hilariously probable.

@sammyjava sounds fine, we will take it from here and backfill as needed until we tell you this is a validate-able required file for the collection @nathanweeks I think that's more or less what Sam was originally proposing (and what Steven intends to put into the souschef so that it comes out as gospel truth in a table)

sammyjava commented 1 year ago

Is it possible to distinguish chromosomes vs. scaffolds for all genome collections by defining an extended regular expression for each in the collection's README?

It doesn't have to be as fancy as a regex, just listing the prefixes as I proposed does the trick. But the more exact listing in seqid_map.tsv is certainly preferable, and simpler, when it exists which @adf-ncgr has offered to make happen. There's no massive urgency on my end, I already have all the prefixes listed for my purposes.

StevenCannon-USDA commented 1 year ago

@adf-ncgr - "there are cases where people have actually applied chr0 to a bonafide chr that everyone else is calling chr1" Yep. One place I saw it was in the new (private) Quillaja assembly.

A variant problem: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9052405/ Their "chromosome" assemblies are named "scaf001" through "scaf014". I have prepared this assembly and annotation, though it is still in "private."

@sammyjava - I like your solution (prefixes indicated in the README) -- and agree that it should be sufficient. It's just that I might as well get those two fields into the upstream yaml for preparing the assemblies and annotations; and then might as well get that information into the seqid_map.tsv files, at least prospectively.

sammyjava commented 1 year ago

I've updated datastore-specifications/Genus/species/genomes to include the third column in seqid_map.tsv (on its own page).

As for the README prefixes, it sounds like no one objects (this being an RFO). If that is truly the case, I'll add them to the spec as genome collection-specific README fields and to all the READMEs. Going once...going twice...

nathanweeks commented 1 year ago

If a simple prefix is enough to distinguish all chromosomes vs. scaffolds (and "sticky-wicky scaffold names like Gm06_scaffold_99" are purely hypothetical) then that's fine. One use case I had in mind for using a regex was to aid in the creation of default JBrowse URLs without having to read an additional data file that doesn't exist in the datastore-metadata repo (e.g., the regex /glyma.Wm82_ISU01.gnm2.Gm[0-2][0-9]/ implies that the first chromosome is "glyma.Wm82_ISU01.gnm2.Gm01" , rather than "glyma.Wm82_ISU01.gnm2.Gm1"). But that's admittedly quite niche...

adf-ncgr commented 1 year ago

FWIW, the "Gm06_scaffold_99" was not purely hypothetical; this is how some scaffolds are named in the Phytozome version of what we call glyma.Wm82.gnm4 (well, I made up that particular name but "Gm13_scaffold_21" is a real one). Looks like someone recorded their displeasure with this scheme in the CHANGES file for this dataset: perl -pi -e 's/^>glyma.Wm82.gnm4.Gm\d\dscaffold/>glyma.Wm82.gnm4.scaffold_/' glyma.Wm82.gnm4.4PTR.genome_main.fna

this was in pre-ds_souschef world, although the file glyma.Wm82.gnm4.4PTR.info_name_hash.txt tells us: glyma.Wm82.gnm4.scaff_21 Gm13_scaffold_21 which is itself a little odd as the final given name was actually glyma.Wm82.gnm4.scaffold_21

In any case, I believe we're converging on a solution to "what they called it -> what we call it" that we can all abide by (or at least all agree to abide by). But really I think we eventually should just let ChatGPT rename chromosomes for us (as suggested by your idea to use a regex as a generative grammar)

StevenCannon-USDA commented 1 year ago

Additional thoughts, after looking through all 136 genome_main.fna.gz.fai files: (1) We want to distinguish between "chromosomes" and "other stuff." The other stuff consists of various things: supercontigs, contigs, mitochondrial and chloroplast genomes, (inappropriately) merged leftover sequence. So, we may be better-off just recording patterns that correspond with chromosomes - i.e. only adding "chromosome_pattern: " (2) A simple prefix will be sufficient in most cases, but not all. A regex may be required. Here are two pathological cases (where a prefix isn't sufficient):

lupal.Amiga.gnm1.Lalb_Chr24
lupal.Amiga.gnm1.Lalb_Chr25
lupal.Amiga.gnm1.Lalb_Chr00c01
lupal.Amiga.gnm1.Lalb_Chr00c02

vitvi.PN40024.gnm2.chr8
vitvi.PN40024.gnm2.chr9
vitvi.PN40024.gnm2.chr9_random
vitvi.PN40024.gnm2.chrUn
sammyjava commented 1 year ago

That's why I allow lists of prefixes. In your first case, I list it like this:

chromosome.lupal.Amiga.gnm1.F4NR=Lalb_Chr01,Lalb_Chr02,Lalb_Chr03,Lalb_Chr04,Lalb_Chr05,Lalb_Chr06,Lalb_Chr07,Lalb_Chr08,Lalb_Chr09,Lalb_Chr1,Lalb_Chr2                                                            
supercontig.lupal.Amiga.gnm1.F4NR=Lalb_Chr00c

I don't load vitvi so I'm not going to worry about it. Those names are just silly.

sammyjava commented 1 year ago

I'm going to close this since I'm moving forward with it for mine-loading purposes. We can update the spec down the line when someone has an actual application that needs a different implementation. I'll update the README spec for GENUS/species to list the new properties. @cann0010 : I load chromosomes and supercontigs into the mines. Many assemblies ONLY have supercontigs, so we should identify those. I'm doing the work so I'll stick with both. :)

adf-ncgr commented 1 year ago

While I fully endorse closing the issue, I'll note that I think the list-o-prefixes implementation does not address cases like "Gm06_scaffold_99" which is fairly similar to the vitvi silliness; as I observed earlier, this has in some manner been resolved by judicious renaming policies, but if we are giving ourselves the latitude to judiciously rename things we could simply go all the way and just fully standardize chromosome nomenclature (as Ensembl does, I think)? In part, this is what having the full monty seq_id_map files is supposed to let us feel OK about. However, if we think it's not just chromosomes vs not-chromosomes (a not-SO term), then I still think an explicit tabulation will be preferable in the long run. But I should probably just be happy to be off the hook for making said explicit tabulations...

sammyjava commented 1 year ago

I think we agreed that you two @adf-ncgr and @cann0010 would take charge of implementing seqid_map files which contain the SO classification of the assembly sequences across the board. I already know that the prefix hack works, because I use it to load all of the mines. I'm just putting that data into the READMEs, rather than in my secret file. When the seqid_map files are in every genome collection, and are required, then we can drop the prefixes from the READMEs.

adf-ncgr commented 1 year ago

OK, I'll make a different datastore issue for actually getting this done, then.