Closed ptrebert closed 2 years ago
First observations: so far, rule 1 captures something like 90+ percent of all contigs (i.e., not necessarily assembled Y bp), and rule 2 triggers occasionally.
For 18989, the beginning (~PAR1 up to CEN) is missing, and it looks like there are no motif matches for the contig, and some spurious alignments to other chromosomes. Hence, rule 5 should also be implemented with a stringent threshold (say, >90% primary alignments to chrY?) @pilleh what do you think?
For the Y short arm - kind of makes sense that it might be missing (PAR1 and ~3Mb of XTR probably can map to chrX), but I would expect most of the primary alignments to be to chrY, at least from XTR. not sure what the best threshold would be - could try >90% and see what happens? I'm kind of expecting most of the Yp to be assembled as one large contig, which hopefully simplifies its identification, but PAR1 might be more fragmented.
for 18989, there is a contig spanning
AMPL1
AMPL2
HET1_centro
other1
PAR1
XDR1
XDR2
XTR1
XTR2
that is missed with the current selection strategy
try >90% and see what happens?
maybe another "thresholding" problem to revisit when all data are in and we can look at the final distribution; I go with >90% for now
For 18989 - yep, looks like the whole short arm then.
@ptrebert In addition to the steps in the pipeline - we can further confirm the origin of the Y contigs using HiC, and probably BioNano. So all in all, our approach to identify Y contigs should be pretty waterproof.
rule 5 has been implemented, and for 18989, the missing contig is now selected (just testing locally, this change is not yet propagated throughout all results).
Re HiC/Bionano: yes, let's see how that goes.
done c05672c
implement a simple strategy (i.e. rule-based) to identify chrY contigs in the de novo assemblies.
Current rule set:
Motif hits above threshold = "high-quality hits", thresholds set by expert curation