RFO: naming and processing haplotype-resolved genomes and annotation sets

legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

2 stars 0 forks source link

RFO: naming and processing haplotype-resolved genomes and annotation sets #51

Closed StevenCannon-USDA closed 4 months ago

StevenCannon-USDA commented 5 months ago

Background

Until recently, a plant genome sequence could be assumed to consist of sequences corresponding to 1n chromosomes, plus some remaining unplaced scaffolds, and maybe also plastid genomes.

Sequencing technology has matured to the point that some assemblies are being generated that include two haplotypes for a nuclear genome -- that is, for a genome with 1n chromosomes, both chromatids per chromosome are sequenced, and 2n chromosomes are reported. How should these be represented in the Data Store?

There are two general patterns that seem reasonable:

Combine all 2n sequenced chromosomes into a single assembly file, and provide standard annotation files that correspond. Thus, a soybean assembly would have 40 sequences in genome_main.fna, and the gene_models_main.gff3 file would contain gene models for all chromosomes. Within the genome_main.fna file, distinguish the chromosomes by haplotype, e.g. Chr01_hap1 and Chr01_hap2.
Separate the assembly and annotations by haplotype. Thus, a soybean assembly would have 20 sequences in the hap1 genome_main.fna file and 20 sequences in the hap2 genome_main.fna file; and likewise, there would be a hap1 gene_models_main.gff3 and a hap2 gene_models_main.gff3 file.

Proposed:

Use pattern 2 (Separate the assembly and annotations by haplotype) for diploid or diploidized allotetraploid genomes (cases which until recently have been represented with 1n chromosome sequences). An exception is for autopolyploid genomes, where all haplotypes will generally be analyzed together, e.g. alfalfa; in that case use pattern 1

Rationale for using option 2 for most cases:

Many applications and formats presume 1n representation -- for example, VCF, with reference and alternate alleles per SNP location. Also, at least Phytozome is representing haploid-resolved genomes in two sets (similar to option 2).

Case 1:

Autopolyploid genomes in which all chromosome copies are sequenced and presumed to be analyzed together: Represent such assemblies with a single collection per genome and annotation set, and with a single genome_main file that contains all chromosomes. In the case of alfalfa (2n = 4x = 32), this could be handled as e.g. Chr1_hap1, Chr1_hap2, Chr1_hap3, Chr1_hap4 or some similar pattern; for example, in the case of XinJiangDaYe, the chromosomes are named e.g. Chr1.1, Chr1.2, Chr1.3, Chr1.4 Directory and file naming in such cases will follow the typical pattern for genomes and assemblies, e.g.

  Medicago/sativa/genomes/XinJiangDaYe.gnm1.12MR/
  Medicago/sativa/annotations/XinJiangDaYe.gnm1.ann1.RKB9/

Case 2:

Diploid or diploidized allotetraploid genomes, e.g. Phaseolus, Glycine, or Arachis: Represent such assemblies with two sets each of genome and annotation collections -- one for each haplotype. In the case of Cercis canadensis, for example (2n = 2x = 18), this could be handled as e.g.

  annotations/ISC494698.gnm1hap1.ann1.G7XW/
  genomes/ISC494698.gnm1hap1.8Q19/

  annotations/ISC494698.gnm1hap2.ann1.WXZF/
  genomes/ISC494698.gnm1hap2.G6BY/

Some questions / points of debate:

Adding a string to distinguish the chromatid lengthens the directory/file-name string, e.g. gnm1 -> gnm1hap1. Parsers may need to be adjusted. Phytozome has been inconsistent here (or their practice is evolving). For Cercis, they left hap1 without designation, and labeled only hap2 as such (Ccanadensis and CcanadensisHap2); whereas for Chamaecrista, Cfasciculatavar_ISC494698HAP1 and Cfasciculatavar_ISC494698HAP2.
Where to add the haplotype designator? Phytozome is adding alternately (without consistency yet) it to the cultivar or species name, e.g. Cfasciculatavar_ISC494698HAP2 and CcanadensisHAP2. Arguably, it should instead be a modifier on the genome version -- since the species and cultivar are the same.

StevenCannon-USDA commented 5 months ago

Flagging a few people (but anyone please feel free to join in if you have an opinion) - @adf-ncgr @ctcncgr @sdash-github @alancleary @maxglycine @jd-campbell

maxglycine commented 5 months ago

Would the Mines data model have to be changed in any way? Would there need to be changes to the loader scripts. Would this name lengthening impact the other tools (GCV, CoNeKT, etc.) What would happen to the general search functions? Would we have to include "Haplotype" when performing a search?

StevenCannon-USDA commented 5 months ago

@maxglycine Good questions. My inclination is to generally just load hap1 and not touch hap2 ... which might be an argument for leaving the first haplotype unlabeled -- in which case nothing at all would change for the labeling, searching, or display. We would just have some extra hap2 collections in the Data Store. But if we do apply "hap" labels to both sets, then there may be impacts to the things that you mention.

adf-ncgr commented 5 months ago

@StevenCannon-USDA I don't think either option actually necessitates changes to the mine data model or loaders. This contention is based on my understanding of option 2, that you are simply extending the genome version string to include an indicator of haplotype, not creating a separate "field" in the full yuck. Option 1 just puts the haplotype info outside of the full yuck portion of the identifiers, just as we do for isoform distinctions among transcripts.

The question about the gene search is kind of interesting- if we make two separate collections with genome versions reflecting which "hap" is being represented in the genome version string (option 2) then the user could choose one or the other in the dropdown but not both (though I guess we could implement multi-selection for the dropdown). I don't think this is really a limitation, but led me to a thought experiment that in the pangene translation service, modeling the haplotypes as separate data collections (option 2) would support mapping hap1 onto hap2 without making any other changes to the way we've designed that interface, whereas option 1 would probably not give a simple way of effecting the same thing.

I will say that I'm somewhat opposed to having option 1 applied in some cases and option 2 in others; at least, if we are going to use option 2 I'd say we ought to use it for all cases where haplotypic counterparts have actually been labeled; ie we would retrofit Medicago/sativa/genomes/XinJiangDaYe.gnm1.12MR/ to use option 2 but not Medicago/sativa/genomes/RegenSY27x.gnm0_9.2VSM since in the former we have Chr1.1 Chr1.2 etc but in the latter we just have ptg000001 ptg0000002 etc.

Another option that might be considered a hybrid of the two would be to have a single collection but use genome_main and gene_model_main to represent a "primary haplotype" just as we have primary transcript files, etc. and another file that contains all haplotypes together. Just as with annotations that don't provide multiple isoforms, we would have cases where the parallel "all together" file does not exist. Not sure the analogy to multiple isoforms is a great one, but there are some parallels.

StevenCannon-USDA commented 5 months ago

@adf-ncgr - here is a rif on your hybrid model. This seeks to make both haploid assemblies available, but unobtrusively -- leaving the "primary" haplotype generally without special labeling. A diploid/2n version can be provided when needed, e.g. for alfalfa. (Note: CADL_HM342 is a hypothetical new assembly, gnm2, to illustrate a diploid alfalfa.)

For assemblies in which both haploid chromosome sets have been sequenced, provide the primary (hap1) set under "gnmN" and the secondary (hap2) under "gnmN_hap2".

An exception to the practice of labeling only haplotype 2 with e.g. gnm1_h2 and leaving haplotype 1 unlabeled (e.g. gnm1) is in unusual cases in which the haploid sets will be treated as co-equal and analyzed together - as in the case of tetraploid alfalfa.

Optionally, also provide genome and annotation collections in which all chromosomes are merged. Label these as "gnmN_2n" to indicate the 2n-ploidy state. Again, alflafa is an example.

Both haploid assemblies available:

Cercis/canadensis/genomes/cerca.ISC453364.gnm3.KEY4               hap1 implicitly
Cercis/canadensis/genomes/cerca.ISC453364.gnm3_h2.KEY4            hap2 
Cercis/canadensis/annotations/cerca.ISC453364.gnm3.ann1.KEY4      hap1 implicitly
Cercis/canadensis/annotations/cerca.ISC453364.gnm3_h2.ann1.KEY4   hap2 

Chamaecrista/fasciculata/genomes/ISC494698.gnm1.KEY4              hap1 implicitly
Chamaecrista/fasciculata/genomes/ISC494698.gnm1_h2.KEY4           hap2 
Chamaecrista/fasciculata/annotations/ISC494698.gnm1.ann1.KEY4     hap1 implicitly
Chamaecrista/fasciculata/annotations/ISC494698.gnm1_h2.ann1.KEY4  hap2

Both haploid assemblies available, and a merged (2n) assembly is provided

Medicago/sativa/genomes/XinJiangDaYe.gnm1_2n.KEY4                 2n ploidy; 32 chromosomes
Medicago/sativa/annotations/XinJiangDaYe.gnm1_2n.ann1.KEY4        2n ploidy; 32 chromosomes

Medicago/sativa/genomes/XinJiangDaYe.gnm1_h1.KEY4                 hap1 
Medicago/sativa/genomes/XinJiangDaYe.gnm1_h2.KEY4                 hap2 
Medicago/sativa/genomes/XinJiangDaYe.gnm1_h3.KEY4                 hap3 
Medicago/sativa/genomes/XinJiangDaYe.gnm1_h4.KEY4                 hap4 

Medicago/sativa/annotations/XinJiangDaYe.gnm1_h1.ann1.KEY4        hap1 
Medicago/sativa/annotations/XinJiangDaYe.gnm1_h2.ann1.KEY4        hap2 
Medicago/sativa/annotations/XinJiangDaYe.gnm1_h3.ann1.KEY4        hap3 
Medicago/sativa/annotations/XinJiangDaYe.gnm1_h4.ann1.KEY4        hap4

Both haploid assemblies available, and a merged (2n) assembly is provided

Medicago/sativa/genomes/CADL_HM342.gnm2_2n.KEY4                   2n ploidy; 16 chromosomes
Medicago/sativa/annotations/CADL_HM342.gnm2_2n.ann1.KEY4          2n ploidy; 16 chromosomes

Medicago/sativa/genomes/CADL_HM342.gnm2.KEY4                      hap1 implicitly
Medicago/sativa/genomes/CADL_HM342.gnm2_h2.KEY4                   hap2 

Medicago/sativa/annotations/CADL_HM342.gnm2.ann1.KEY4             hap1 implicitly
Medicago/sativa/annotations/CADL_HM342.gnm2_h2.ann1.KEY4          hap2

adf-ncgr commented 5 months ago

Thanks @StevenCannon-USDA for the extra option to consider (who knew there would be so many ways to recombine genetic information for non-inbred genomes?!). I have to say that I'm not all that wild about this last one, though. Part of the idea behind the last one I'd suggested was to avoid a proliferation of collections and the new proposal increases the count even above your original option 2. I also think it could be confusing to our software/data management to have some collections that are combinations of other collections, though we could probably find ways to deal with it. What's the advantage you see of doing things this way over what I'd proposed (one collection with files representing the primary/all options)? Looking forward to hearing some others weigh in too!

StevenCannon-USDA commented 5 months ago

@adf-ncgr The hybrid model that I sketched was trying to accomplish your suggestion of "another file that contains all haplotypes together." I was just putting that into a separate collection -- since I don't think we should put different assembly variants into a single collection, as that would complicate the preparation, the metadata, and the genome--annotation relationship.

But you make a good point about the "proliferation of collections." So, I'll now vote against my proposal for a merged collection type (gnm1_2n).

So, my proposal reduces back to the original Option 2: "Separate the assembly and annotations by haplotype", with possible minor tweaks to the directory and file naming conventions. For that, I am reasonably happy with this:

Cercis/canadensis/genomes/cerca.ISC453364.gnm3.KEY4               hap1 implicitly
Cercis/canadensis/genomes/cerca.ISC453364.gnm3_h2.KEY4            hap2 
Cercis/canadensis/annotations/cerca.ISC453364.gnm3.ann1.KEY4      hap1 implicitly
Cercis/canadensis/annotations/cerca.ISC453364.gnm3_h2.ann1.KEY4   hap2

adf-ncgr commented 5 months ago

@StevenCannon-USDA I don't see how having an extra file containing all of the haplotypes together complicates things any more than having all transcript/protein models together in an extra file does. Can you elaborate? A compromise might be to have two collections, one for the "primary" haplotype and one for "all together". I'm not crazy about having them as separate collections, since they share some data (and presumably all of the provenance metadata would be identical). But could probably live with it. It might seem a little weird in the case of diploids to do it this way, but I think it has advantages for the autotetraploids (2 collections vs 4). One sticking point is that the identifiers for the entities that are shared between the "primary" and "all together" collections would not in fact be identical due to the full yuckification being different in the two contexts. That's why I'd prefer my one collection solution. But your all collections as separate would accomplish that too since there would be no repeated info other than the metadata in that scenario.

ctcncgr commented 4 months ago

@StevenCannon-USDA @adf-ncgr, Just to put my two cents in...

We have seen in our recent Medsa assemblies that take advantage of hifi with hic scaffolding within hifiasm that we can create decently apportioned haplopiles that are phased, binned, and pseudo-chromosomed to the best of their ability based on the data. One of the consequences of this is that the assembly as a whole is ~99.8% complete as estimated by BUSCO against fabalesodb10, but the completeness of a given haplotype only reaches ~96%. My initial thought was, "this seems strange...", however, after thinking about it more, there are going to be a number of genes that have drifted after the whole genome duplication due to weak selection and genetic pressure for certain classes of genes within the haplotypes. Therefore it isn't that unlikely that a given haplotype will NOT represent the complete genetic space of the organism making it seem like genes are missing from the assembly if we consider only one haplotype. In fact this is exactly what happens if you use only one of the halotypes from the other tetraploid alfalfa that have already been published. I was also able to show this using genes from the diploid and I am quite confident in the assembly copy numbers based on the kmer inclusion spectra (let me know if you want to know more).

The other issue I see with this is we are creating unnessesary additional description files and data in a way that seems like it could live within a single metadata file (we are using structured metadata).

I'm not really that concerned with what we do as long as it's programmatic, but I would like to see us thinking more about representing haploid resolved assemblies in a combined way, such as some kind of sequence graph instead of as individual files. Someday the pan space will probably just let us do this anyway and maybe we should spend some more time thinking about this. I'm conceptually not a big fan of splitting up haploid resolved assemblies in general because I feel its a lot of work and provides minimal value beyond solving classical issues with short read alignment ambiguity. In my mind there are better ways to solve this. It also discounts what I think was the point of doing the assembly in a combined way in the first place.

StevenCannon-USDA commented 4 months ago

Thanks, @adf-ncgr and @ctcncgr .

Where I am at now about this issue, fwiw:

It comes down to a human convention, whether to represent assemblies as 2n, with all chromosomes in a single file; or as pairs of 1n files, with two sets of haplotypes. So we can do whatever we want; it's just up to us to provide a justification and to ensure sufficient consistency for our own sanity.

For almost all of the ~100 genomes that we have handled to date, the genomes are in 1n file collections. As we start getting haploid-resolved genomes, I am inclined to keep representing them in 1n collections -- which means two collections per genome (hap1 and a hap2, by whatever naming convention we choose). The reason: most analyses to-date have presumed 1n state. The canonical example is VCF files, which implicitly make this presumption (as evidenced by the ref and alt allele fields).

HOWEVER, tetraploid alfalfa really seems different, because I think that most analyses will presume 32 chromosomes. Therefore, in this case, I think it makes sense to house the assembly in a single, 32-chromosome collection -- as though it were a 1n genome with 32 chromosomes. This would require that the chromosomes be suitably named to permit them to be distinguished -- for example, Chr1.1 Chr1.2 Chr1.3 Chr1.4.

ctcncgr commented 4 months ago

Hey @StevenCannon-USDA I can agree to the idea that when it seems more appropriate to think about it biologically as a collection we do so. Which collections this applies to we can discuss later ;). Alfalfa is further confounded in this due to large numbers of rearrangements in offspring wrt parents, but having well phased parents will be useful in the future for pan strategies.

I think as we move towards haploid phased assemblies becoming the norm we can talk about grouping more, but the data is still as you are saying, largely 1h and produced using flags like, "--primary" in hifiasm.

Also as a quick addendum, even the "phased" assemblies we currently have are only maximally phased at like ~10Mb (I think) if we are very lucky. Its not that great of an art yet and I'm confident in the copy number binning but not the actual phasing of all alleles that are placed in each pile.

adf-ncgr commented 4 months ago

OK, so it sounds like we're just going to have two approaches (lumper and splitter) to be applied at the curator's discretion. I guess that's fine, we can always revisit later if we don't like the consequences of different genomes behaving differently in other parts of the various systems. So I'll just retract my earlier objection to having both options in play in the datastore. I still think that the single collection model with option to designate a primary assembly representing a single haplotype subset (analogous to all isoform/primary isoform for annotations) seems to provide the best of both worlds, but moving ahead with something we can all live with seems best.

maxglycine commented 4 months ago

@adf-ncgr @StevenCannon-USDA @ctcncgr I am not familiar with tetraploid alfalfa. IF it is possible to have different 4N alfalfa (different strains of 4N alfalfa) then the chromosome naming convention as stated above may not be adequate. You would have to have something like strain1.Chr1.1, strain1.Chr1.2, strain1.Chr1.3 and strain1.Chr1.4 wouldn't you?

StevenCannon-USDA commented 4 months ago

@maxglycine Good point. The full-yucked chromosome IDs would be like this: medsa.RegenSY27x.gnm1.Chr1.1 ... or however else the sequencing/annotation group decides to name the homoeologous chromosomes.

adf-ncgr commented 4 months ago

To make slightly more concrete, here are some existing alfalfa genomes; you'll see that medsa.XinJiangDaYe.gnm1 has 4 versions of chr2 present, whereas medsa.Zhongmu_No_1.gnm1 has only one. This doesn't reflect a biological difference but a choice made by the authors of each assembly. medsa.PI464715.gnm1 also only has one version of chr2, and it is from a diploid alfalfa (which would be more obvious if it were haplotype resolved).

StevenCannon-USDA commented 4 months ago

I added what I believe to the the consensus to the genomes and annotations specifications READMES. See here for genomes and here for annotations; and detailed data-preparation notes in PROTOCOLS/ds_souschef_prep_examples.

Will close this issue now.