High heterogeneity in sequences identity

lsoldini commented 2 years ago

Hello,

I am planning to do some transcriptomic analysis using vg mpmap and rpvg. To get there, I want to first build a .gfa graph with pggb.

The particularity is that I have one chromosome-level haplotype-resolved assembly as well as another same-quality assembly but of only one of the chromosome -i.e., one assembly of X chromosomes and another of only 1 chromosome (that chromosome has large inversions of several Mbs whereas all other chromosomes are very similar/identical).

How would you build a graph with such data ?

I was thinking I could complete the partial assembly with the chromosomes from the other, and then run pggb on the two 'full' assemblies, with the assumption that all identical reigons would be recognized and collapsed. Or, alternatively, should I do different runs of pggb (one with all except the divergent chromosome, and the other with only that chromosome) and later merge the .gfa ?

Edit: the name is because I am wondering whether having most regions with 100% identity and one chromosome with quite lower value would be an issue.

subwaystation commented 1 year ago

So you have the following data sets:

haplotype-resolved assembly of all chromosomes (a)
haplotype-resolved assembly of one chromosome (b)
assembly of one chromosome with large inversions (c)

Are the chromosomes of (b) and (c) the same? Did you take a look at https://pggb.readthedocs.io/en/latest/rst/tutorials/divergence_estimation.html in order to measure the actual sequence divergence?

lsoldini commented 1 year ago

Sorry, it was not clear. There is population A and B, and there is basically no divergence between them, except for one chromosome (say population A has version AX and population B has version BX of chromosome X). I have:

One haplotype-resolved assembly of all -except X- chromosomes (common to both A and B)
One haplotype-resolved assembly of chromosome AX
One haplotype-resolved assembly of chromosome BX

Chromosome AX and BX have quite diverged because of loss of recombination and large inversions.

I want to build a graph over their whole genome, for further use in vg toolkit. For doing so, I'd like to use two assemblies in which all chromosomes are the same, except one. Would this be a problem for pggb ? The sequence divergence being tuned for the one chromosome that is different.

pangenome / pggb

High heterogeneity in sequences identity #247