marschall-lab / project-male-assembly

HGSVC SIG: targeted chromsome Y assembly
MIT License
8 stars 1 forks source link

identify Y contigs #8

Closed ptrebert closed 2 years ago

ptrebert commented 2 years ago

implement a simple strategy (i.e. rule-based) to identify chrY contigs in the de novo assemblies.

Current rule set:

Motif hits above threshold = "high-quality hits", thresholds set by expert curation

ptrebert commented 2 years ago

First observations: so far, rule 1 captures something like 90+ percent of all contigs (i.e., not necessarily assembled Y bp), and rule 2 triggers occasionally.

For 18989, the beginning (~PAR1 up to CEN) is missing, and it looks like there are no motif matches for the contig, and some spurious alignments to other chromosomes. Hence, rule 5 should also be implemented with a stringent threshold (say, >90% primary alignments to chrY?) @pilleh what do you think?

pilleh commented 2 years ago

For the Y short arm - kind of makes sense that it might be missing (PAR1 and ~3Mb of XTR probably can map to chrX), but I would expect most of the primary alignments to be to chrY, at least from XTR. not sure what the best threshold would be - could try >90% and see what happens? I'm kind of expecting most of the Yp to be assembled as one large contig, which hopefully simplifies its identification, but PAR1 might be more fragmented.

ptrebert commented 2 years ago

for 18989, there is a contig spanning

AMPL1
AMPL2
HET1_centro
other1
PAR1
XDR1
XDR2
XTR1
XTR2

that is missed with the current selection strategy

try >90% and see what happens?

maybe another "thresholding" problem to revisit when all data are in and we can look at the final distribution; I go with >90% for now

pilleh commented 2 years ago

For 18989 - yep, looks like the whole short arm then.

pilleh commented 2 years ago

@ptrebert In addition to the steps in the pipeline - we can further confirm the origin of the Y contigs using HiC, and probably BioNano. So all in all, our approach to identify Y contigs should be pretty waterproof.

ptrebert commented 2 years ago

rule 5 has been implemented, and for 18989, the missing contig is now selected (just testing locally, this change is not yet propagated throughout all results).

Re HiC/Bionano: yes, let's see how that goes.

done c05672c