marschall-lab / project-male-assembly

HGSVC SIG: targeted chromsome Y assembly
MIT License
8 stars 1 forks source link

hg38 Y seq annotations: align to all de novo Y assm #2

Open ptrebert opened 2 years ago

ptrebert commented 2 years ago

Pille:

Additional Y sequence annotations from hg38 - please add a step to align these to all chrY assemblies:
- using '--secondary=no'
- using '--secondary=yes' but not restricting the number of sec.alignments allowed
I've added both the bed and fasta to Globus /HHU/references/ - GRCh38_chrY-seq-classes_coord_plus_repeats.bed/fasta
These contain most of the Y repeats and SDs, so they are useful to understand the Y structure.
ptrebert commented 2 years ago

@pilleh Regarding this file GRCh38_chrY-seq-classes_coord_plus_repeats.bed: if you add more coordinate (BED) files in the future, please try to make sure that they are sorted.

Is it worthwhile to bug the T2T folks about similar annotation (SD and repeat) for the T2T Y?

pilleh commented 2 years ago

@ptrebert Sure thing, I'll sort them in the future, thanks. In principle we need to do this annotation anyway, also for T2T, and I have kind of been doing it until now. In most regions it's quite straightforward, but in some of the ampliconic regions it's a bit of a pain as they are quite rearranged. Might be good to discuss this in one of the coming Y meetings.

ptrebert commented 2 years ago

this has been implemented now for the GRCh38 file.

pilleh commented 2 years ago

@ptrebert I'm sorry, but I noticed that one more of these sequence class end/start points (specifically for XDR2/AMPL2) didn't make sense so I modified it slightly. It moved the boundary by ~18kb towards PAR1. This should not affect chrY contig identification. But it would be good to re-run the 'hg38 and T2T Y seq. classes to assembly' steps in case you have those implemented. I've added new versions to the HHU reference folder: T2T.chrY-seq-classes-NEW.bed GRCh38_chrY-seq-classes_coord_plus_repeats_NEW.bed I'm sorry about this. These coordinates have been merged from a few previous publications, which does not simply matters much. I hope I won't have to mess with them again.

ptrebert commented 2 years ago

This should not affect chrY contig identification

The problem with these types of assumptions is that they might be wrong for certain samples and we won't notice until much later :-) Shifting boundaries may affect contig renaming, though, which implies that everything has to be rerun anyway. I will wait with that until HMMER has been updated (they implemented a fix, but the fix still needs to be merged into their code). Can you "quantify" this in today's call s.t. we can get a feeling for how likely it is that we need more of these sequence class updates in the future?

one more of these sequence class end/start points didn't make sense