marbl / CHM13

The complete sequence of a human genome
Other
920 stars 99 forks source link

Why are PAR regions in X and Y different sizes? #81

Closed duartemolha closed 1 year ago

duartemolha commented 1 year ago

Hi

I was surprised to find that the PAR regions in X and Y do not match in terms of length.

It was my understanding that PAR regions should be exactly the same size on both chromossomes

arangrhie commented 1 year ago

Hello,

The PARs are not expected to be exactly at the same size. It was assumed so because it was copy-pasted that way in the GRCh reference.

Also note the X comes from CHM13, the Y comes from HG002. The PARs are not the same size either on HG002XY, because there are a few X/Y specific variations.

You can see even from short-read variant calls, after masking the PARs on the Y and mapping / variant calling on the X-PARs, there are handful of variants called as "heterozygous".

There are more details in the T2T-Y preprint, see Fig. 3d and EDFig. 9a.

Best, Arang

duartemolha commented 1 year ago

Saying that the par regions is not the same because it has variations does not make sense to me.

That is the same as saying autossonal pairs would have different sequence because one of them has a variation.

Their are automossal and there is crossover with every meiosis.

Does crossover not occur in PAR regions?

On Wed, 10 May 2023, 21:40 Arang Rhie, @.***> wrote:

Hello,

The PARs are not expected to be exactly at the same size. It was assumed so because it was copy-pasted that way in the GRCh reference.

Also note the X comes from CHM13, the Y comes from HG002. The PARs are not the same size either on HG002XY, because there are a few X/Y specific variations.

You can see even from short-read variant calls, after masking the PARs on the Y and mapping / variant calling on the X-PARs, there are handful of variants called as "heterozygous".

There are more details in the T2T-Y preprint https://www.biorxiv.org/content/10.1101/2022.12.01.518724v1, see Fig. 3d and EDFig. 9a.

Best, Arang

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1542780675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQIVJ6G4O2WF5FWLHGOVTXFP4LZANCNFSM6AAAAAAX3FCJIM . You are receiving this because you authored the thread.Message ID: @.***>

skoren commented 1 year ago

Not sure I follow your question, the autosomes are not identical in size in a diploid either. In this case, the X and Y are also from different samples/cell lines so they would never have been present in the same cell.

duartemolha commented 1 year ago

I think we are in disagreement because i am looking at t2t as a reference. Maybe that is incorrect.

Grch37 and grch38 only have the sequencing of 1 chromosome 1 correct? They are assuming the pair have the same sequence.

In humans every chromosome has a different sequence because each have mutations. But that is not contained in the reference.

The same should be the case with the par regions for t2t.

On Fri, 12 May 2023, 22:41 Sergey Koren, @.***> wrote:

Not sure I follow your question, the autosomes are not identical in size in a diploid either. In this case, the X and Y are also from different samples/cell lines so they would never have been present in the same cell.

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546334325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQIVNDVA3RIZ5SZPIJTFLXF2VABANCNFSM6AAAAAAX3FCJIM . You are receiving this because you authored the thread.Message ID: @.***>

ekg commented 1 year ago

As far as I know, the PARs are not guaranteed a crossover per meiosis. I'm not sure what information you're working from.

It's totally natural that they have different lengths and totally unnatural than the versions in GRC were exactly the same.

Note that T2T-CHM13 contains pseudo homologous regions (PHRs) on the acrocentric short arms which support recombination between heterologs. https://doi.org/10.1038/s41586-023-05976-y

In the past we would mask regions like the PARs and PHRs. Now with the advent of diploid complete assemblies I think we will need to keep track of them rather than trying to mask them out.

You can of course mask using the intervals, should your application depend on having only one copy of each homologous region.

But this is a slippery problem. There are big segmental duplications of all types which support recombination. Although it's driven by non-crossover type or gene conversion in most cases, this will lead to homogenization and cause all the same issues for tools that assume there is only one copy of each locus in the reference.

On Sat, May 13, 2023, 03:28 Duarte @.***> wrote:

I think we are in disagreement because i am looking at t2t as a reference. Maybe that is incorrect.

Grch37 and grch38 only have the sequencing of 1 chromosome 1 correct? They are assuming the pair have the same sequence.

In humans every chromosome has a different sequence because each have mutations. But that is not contained in the reference.

The same should be the case with the par regions for t2t.

On Fri, 12 May 2023, 22:41 Sergey Koren, @.***> wrote:

Not sure I follow your question, the autosomes are not identical in size in a diploid either. In this case, the X and Y are also from different samples/cell lines so they would never have been present in the same cell.

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546334325, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFQIVNDVA3RIZ5SZPIJTFLXF2VABANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546596479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELSXDZ2VGJHE5ZZO5DXF5A4FANCNFSM6AAAAAAX3FCJIM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

duartemolha commented 1 year ago

Garanteed crossover... i do not think so.

But i do believe crossover can occur. The same with every autossomal region in autossomal chromossomes.

Sure there are biallelic references in the pangenome. But the alignment is not done vs each individual sample. It is done vs the graph... correct?

If we made a graph for the CHM3 there would be only 1 path through the majority of the par regions.

But ok. I think i understand the source of my confusion.

I was thinking of references as linear ... line the ones until now.

On Sat, 13 May 2023, 13:16 Erik Garrison, @.***> wrote:

As far as I know, the PARs are not guaranteed a crossover per meiosis. I'm not sure what information you're working from.

It's totally natural that they have different lengths and totally unnatural than the versions in GRC were exactly the same.

Note that T2T-CHM13 contains pseudo homologous regions (PHRs) on the acrocentric short arms which support recombination between heterologs. https://doi.org/10.1038/s41586-023-05976-y

In the past we would mask regions like the PARs and PHRs. Now with the advent of diploid complete assemblies I think we will need to keep track of them rather than trying to mask them out.

You can of course mask using the intervals, should your application depend on having only one copy of each homologous region.

But this is a slippery problem. There are big segmental duplications of all types which support recombination. Although it's driven by non-crossover type or gene conversion in most cases, this will lead to homogenization and cause all the same issues for tools that assume there is only one copy of each locus in the reference.

On Sat, May 13, 2023, 03:28 Duarte @.***> wrote:

I think we are in disagreement because i am looking at t2t as a reference. Maybe that is incorrect.

Grch37 and grch38 only have the sequencing of 1 chromosome 1 correct? They are assuming the pair have the same sequence.

In humans every chromosome has a different sequence because each have mutations. But that is not contained in the reference.

The same should be the case with the par regions for t2t.

On Fri, 12 May 2023, 22:41 Sergey Koren, @.***> wrote:

Not sure I follow your question, the autosomes are not identical in size in a diploid either. In this case, the X and Y are also from different samples/cell lines so they would never have been present in the same cell.

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546334325, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAFQIVNDVA3RIZ5SZPIJTFLXF2VABANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546596479, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABDQELSXDZ2VGJHE5ZZO5DXF5A4FANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546636940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQIVM4A5QATFXJHS3S52LXF53SPANCNFSM6AAAAAAX3FCJIM . You are receiving this because you authored the thread.Message ID: @.***>

ekg commented 1 year ago

In a pangenome graph built for read mapping, these regions will likely be collapsed. That means that there will be one region of the graph with two or more chromosomes (paths) from the same reference genome.

I say this somewhat hypothetically because we don't have many experiments based on graphs that collapse them, but there is evidence that this actually helps downstream applications and makes things easier for read mapping to the pangenome.

If the graph is built without chromosome partitioning, I'd expect the acrocentric PHRs, the PARs and the XTR to come together in this way.

Other very large segmental duplications might behave like PHRs. We need to do the survey. If they aren't sub-telomeric, then crossover will probably be selected out and all we will see is gene conversion. Telomeric PHRs (like the PARs) should support crossover because the disruption to the chromosomes involved is minimal.

On Sun, May 14, 2023, 12:29 Duarte @.***> wrote:

Garanteed crossover... i do not think so.

But i do believe crossover can occur. The same with every autossomal region in autossomal chromossomes.

Sure there are biallelic references in the pangenome. But the alignment is not done vs each individual sample. It is done vs the graph... correct?

If we made a graph for the CHM3 there would be only 1 path through the majority of the par regions.

But ok. I think i understand the source of my confusion.

I was thinking of references as linear ... line the ones until now.

On Sat, 13 May 2023, 13:16 Erik Garrison, @.***> wrote:

As far as I know, the PARs are not guaranteed a crossover per meiosis. I'm not sure what information you're working from.

It's totally natural that they have different lengths and totally unnatural than the versions in GRC were exactly the same.

Note that T2T-CHM13 contains pseudo homologous regions (PHRs) on the acrocentric short arms which support recombination between heterologs. https://doi.org/10.1038/s41586-023-05976-y

In the past we would mask regions like the PARs and PHRs. Now with the advent of diploid complete assemblies I think we will need to keep track of them rather than trying to mask them out.

You can of course mask using the intervals, should your application depend on having only one copy of each homologous region.

But this is a slippery problem. There are big segmental duplications of all types which support recombination. Although it's driven by non-crossover type or gene conversion in most cases, this will lead to homogenization and cause all the same issues for tools that assume there is only one copy of each locus in the reference.

On Sat, May 13, 2023, 03:28 Duarte @.***> wrote:

I think we are in disagreement because i am looking at t2t as a reference. Maybe that is incorrect.

Grch37 and grch38 only have the sequencing of 1 chromosome 1 correct? They are assuming the pair have the same sequence.

In humans every chromosome has a different sequence because each have mutations. But that is not contained in the reference.

The same should be the case with the par regions for t2t.

On Fri, 12 May 2023, 22:41 Sergey Koren, @.***> wrote:

Not sure I follow your question, the autosomes are not identical in size in a diploid either. In this case, the X and Y are also from different samples/cell lines so they would never have been present in the same cell.

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546334325, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAFQIVNDVA3RIZ5SZPIJTFLXF2VABANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546596479, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AABDQELSXDZ2VGJHE5ZZO5DXF5A4FANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546636940, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAFQIVM4A5QATFXJHS3S52LXF53SPANCNFSM6AAAAAAX3FCJIM

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/81#issuecomment-1546865559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENALVLQSGOV4X7C7VTXGCXYLANCNFSM6AAAAAAX3FCJIM . You are receiving this because you commented.Message ID: @.***>