marschall-lab / project-male-assembly

HGSVC SIG: targeted chromsome Y assembly
MIT License
8 stars 1 forks source link

orient/order chrY contigs using T2T alignments #3

Closed ptrebert closed 2 years ago

ptrebert commented 2 years ago

Order (orient) identified chrY contigs on the basis of the contig-to-reference alignments relative to T2T. Contigs should be renamed; coordinate with Pille about naming convention.

Pille:

Contig names - maybe something as easy as just numbering from PAR1 as 1 to PAR2. E.g. chrY_01 to chrY_N?
Or do you think it would be worth retaining the original contig names as well?
[...] does it make sense to keep those with original contig names, or do we just rename the Y contigs?
ptrebert commented 2 years ago

I have prototyped a very simple approach that determines contig order based on the contig alignment order, and orientation as a simple majority decision (= most of the contig aligns forward or revcomp to T2T). My proposal for the renaming would then be as follows:

Examples:

# if no two tigs span the same region(s), part number is always 1 (padded: 01)
chrY.21-23.01.AMPL7-HET3_Yq.FW.utig4-1293.HC18989

# enumerated part numbers to get a proper sort order
chrY.09-09.01.HET1_centro-HET1_centro.RV.utig4-1302.HC18989
chrY.09-09.02.HET1_centro-HET1_centro.RV.utig4-1408.HC18989

# if any: unaligned contigs get default values
# and thus end up at the end of any sorted list
chrY.99-99.00.UNK-UNK.NA.utigN-XXX.SAMPLE

@pilleh what do you think?

edit: added field 7 (sample name)

pilleh commented 2 years ago

Sounds great, very informative names

ptrebert commented 2 years ago

I think I am going to append also the (simple) sample name at the end. This might be useful for analyses that compare all-vs-all sequences or the like...

ptrebert commented 2 years ago

this seems to work as intended

pilleh commented 2 years ago

@ptrebert Just wanted to double-check - are you reversing (or planning to) Y contigs to the same orientation with the T2T Y? I was about to check the Y sequence classes of HG01890 as it looks pretty complete as well (and dump it to Mark&Miriam after :P), but it's in the reverse orientation. I can rev complement it unless it's easy to do it as part of the pipeline. What do you think?

ptrebert commented 2 years ago

Summary of the discussion about this (2022-03-29):

pilleh commented 2 years ago

@ptrebert I've been thinking about this as well and I kept reaching the same conclusion - that it would just be easier if any chrY contigs were rc'd as early in the pipeline as possible, both in the WG and chrY sets. I don't know how to best set this up, but even re-running hmmer with the chrY motifs on whole genome assembly and RepeatMasker chrY after Y contigs are rc'd would be very useful, so at least all the coordinates would be consistent.

ptrebert commented 2 years ago

"the setup" is trivial, and yes, I also think a little bit of computational overhead to make everything consistent downstream is better. I had to spend a lot of time today debugging another project (to get this off my desk), but I think all of the above will be ready to update the current results starting tomorrow.

pilleh commented 2 years ago

@ptrebert thanks and sounds good :) I hope I'll be able to do something more meaningful from tomorrow onwards... this week so far has almost entirely gone into replacing Qihui...

ptrebert commented 2 years ago

All available assemblies are now updated (wg and Y-only) with renamed and reoriented contigs if applicable. Our cluster is currently at capacity, so this may take a while.

ptrebert commented 2 years ago

Reported by Pille:

I have found a few chrY contigs which are mislabelled and I wonder what to do with these if anything (e.g. chrY.07-21.01.AMPL2-AMPL7.FW.unassigned-0001334.NA18989 - it actually only contains AMPL7). There are 2 others like this where just the label could be edited.

Checking the contig alignments reveals that there are two small alignments to AMPL2 (last two rows):

unassigned-0001334      1100666 0       999588  +       chrY    62460029        23487298        24482110        981085  1013006 60
unassigned-0001334      1100666 1024114 1100666 -       chrY    62460029        26931941        27008494        76532   76557   60
unassigned-0001334      1100666 1007007 1023490 -       chrY    62460029        10311075        10319612        7171    17018   60
unassigned-0001334      1100666 1005701 1006704 -       chrY    62460029        10319608        10320590        883     1018    60

Depending on how the other two cases look like (@pilleh please add them to this issue), the label could be made more appropriate by introducing more rules, e.g.

if less than X% of the contig aligns to a distant (non-neighboring) sequence class, omit this alignment from the new contig label

for the example above, ~97% of the contig aligns to AMPL7, so a possible threshold could be 5%

pilleh commented 2 years ago

Hey, These are all 3 I have found so far (mislabelling): chrY.07-21.01.AMPL2-AMPL7.FW.unassigned-0001334.NA18989 - contains only AMPL7 chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492 - contains only AMPL7 chrY.07-21.01.AMPL2-AMPL7.RV.unassigned-0002422.HG02572 - contains only AMPL7

The one contig which should be reversed, is this: chrY.03-09.01.XTR1-HET1_centro.FW.unassigned-0009118.HG03492

If it is easy to add a rule like the one you suggest then yes sure. I suppose the risk is that it might cause problems elsewhere, but we won't know unless we try right

pilleh commented 2 years ago

I'm sure overall my checks are limited as well - there is a good chance I'm not picking everything up...

ptrebert commented 2 years ago

For HG032492, we have this short alignment

unassigned-0009268      3464153 0       88652   -       chrY    62460029        7230840 7317118 81193   90941   27

about ~2.5% of the contig length.

For HG02572, we have this:

unassigned-0002422      898225  106459  898225  -       chrY    62460029        25317486        26102100        752899  823139  60
unassigned-0002422      898225  0       81936   +       chrY    62460029        26926557        27008494        81912   81943   60
unassigned-0002422      898225  82560   99040   +       chrY    62460029        10311075        10319612        7171    17018   60
unassigned-0002422      898225  99343   100346  +       chrY    62460029        10319608        10320590        883     1018    60

about ~2% of the contig length.

I will change the renaming as follows then:

  1. compute the pct. alignment for all consecutive hits among the seq. classes
  2. in the cases above, XDR2 or AMPL2 are subsequences of length 1 (no neighboring seq. classes), with low alignment percentage (let's fix 5%)
  3. for renaming, these sequence classes are omitted
  4. a long contig should, however, not be split in two if there are alignment problems in small regions such as "other", i.e. AMPL1 - XTR2 - XDR2 - AMPL2 - <gap in aln / other1 is missed> - HET1_centro - XDR3 - AMPL3 should still be labeled as "AMPL1-AMPL3", and not omit either of the two sides.

Does that make sense?

ptrebert commented 2 years ago

Re extracting chrX: is it sufficient to just get the contigs extracted (that would be trivial to achieve)?

pilleh commented 2 years ago

Re extracting chrX: is it sufficient to just get the contigs extracted (that would be trivial to achieve)?

yes, I think this would be more than enough :)

pilleh commented 2 years ago

For HG032492, we have this short alignment

unassigned-0009268      3464153 0       88652   -       chrY    62460029        7230840 7317118 81193   90941   27

about ~2.5% of the contig length.

For HG02572, we have this:

unassigned-0002422      898225  106459  898225  -       chrY    62460029        25317486        26102100        752899  823139  60
unassigned-0002422      898225  0       81936   +       chrY    62460029        26926557        27008494        81912   81943   60
unassigned-0002422      898225  82560   99040   +       chrY    62460029        10311075        10319612        7171    17018   60
unassigned-0002422      898225  99343   100346  +       chrY    62460029        10319608        10320590        883     1018    60

about ~2% of the contig length.

I will change the renaming as follows then:

  1. compute the pct. alignment for all consecutive hits among the seq. classes
  2. in the cases above, XDR2 or AMPL2 are subsequences of length 1 (no neighboring seq. classes), with low alignment percentage (let's fix 5%)
  3. for renaming, these sequence classes are omitted
  4. a long contig should, however, not be split in two if there are alignment problems in small regions such as "other", i.e. AMPL1 - XTR2 - XDR2 - AMPL2 - <gap in aln / other1 is missed> - HET1_centro - XDR3 - AMPL3 should still be labeled as "AMPL1-AMPL3", and not omit either of the two sides.

Does that make sense?

Yes, I think it makes sense. There are a few samples with large inversions that kind of mess up the 'expected' sequence class order. But in case those get mislabelled, I should be able to pick it up. Will you re-run the contig labelling for all samples?

ptrebert commented 2 years ago

I will rerun that when the current batch is done (hopefully, that would then also include NA19317). Until then, you have time to identify more of these inverted cases, there is no other option but hard-coding those, I'm afraid

Re chrX: ok, will also be added when rerunning the entire thing.

pilleh commented 2 years ago

thanks, sounds great :)

ptrebert commented 2 years ago

I have finished implementing the necessary changes, and the test for HG03492 gives the desired output (old vs new, all other contig names stay as-is):

9d8
< chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492
16a16
> chrY.21-21.04.AMPL7-AMPL7.RV.unassigned-0009268.HG03492

Of course, no way of knowing how this might affect other contig names, but hopefully the effects are limited.

ptrebert commented 2 years ago

identifying chrX contigs is available as of 195313a

note renamed script: determine_y_contigs.py to identify_contigs.py

pilleh commented 2 years ago

I have finished implementing the necessary changes, and the test for HG03492 gives the desired output (old vs new, all other contig names stay as-is):

9d8
< chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492
16a16
> chrY.21-21.04.AMPL7-AMPL7.RV.unassigned-0009268.HG03492

Of course, no way of knowing how this might affect other contig names, but hopefully the effects are limited.

Sounds good! How it affects other contig names - this should be pretty straightforward for me to check, so I'll check once you have the new names. You will re-run the whole contig identification part?

ptrebert commented 2 years ago

to be precise: the renaming will be re-run, but not the (sequence) identification. the computational overhead/aftermath is manageable, although not neglectable given the amount of data that we have by now

pilleh commented 2 years ago

Okay, makes sense. I can just check how the naming matches with before and if there are major differences, then check more thoroughly.

ptrebert commented 2 years ago

I just started another test run for HC02666. The renaming (incl. chrX) now takes only a few minutes per assembly, i.e. the renamed assemblies should be available until the end of the week. I am also testing the error detection just for (haploid) chrX and chrY now. That requires a new read alignment because of the potentially changed names, so this will certainly take a few days to finish.

ptrebert commented 2 years ago

just for documentation purposes, and since this is the only contig that you identified as actually inverted, I hard-coded the following exception:

pilleh commented 2 years ago

@ptrebert Another sample with the large inversion in Yp - NA19239, and this contig should be reversed: chrY.04-09.01.AMPL1-HET1_centro.RV.unassigned-0002359.NA19239

ptrebert commented 2 years ago

ok, I have modified the renaming script again for this sample/contig and gonna rerun that.

pilleh commented 2 years ago

@ptrebert thank you!

pilleh commented 2 years ago

@ptrebert I'm so sorry, I was really hoping I would not have to bother you any longer. I've gone through all the other samples where the contig names/order has been changed and they all look fine. But something funny has happened with NA19239. This contig that should have been inverted (chrY.04-09.01.AMPL1-HET1_centro.RV.unassigned-0002359.NA19239), is now substantially larger in size. It was 3645135 bp before, but now in the new version is 9955618 bp. If you have a bit of time, can you check what might have happened?

ptrebert commented 2 years ago

I found the problem; logic error because of the different orientation flip compared to HG03492. I am rerunning things now, sorry for the delay.

pilleh commented 2 years ago

Oh come on, you are on holidays... thanks a lot for looking into it!

ptrebert commented 2 years ago

New FASTA record of the NA19239 chrY on the Globus share:

chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239   3645135 [ ... ]

So that is as it should be now, I presume.

pilleh commented 2 years ago

Hey! I'm so sorry to bother you again, this contig now otherwise looks fine, but it should still be reversed. This one: chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239

ptrebert commented 2 years ago

Pille, I may not understand what you want. The sequence has been revcomp'ed when it was renamed; this is the original sequence in the untouched Verkko assembly (first 80 symbols):

>unassigned-0002359
TAGATTTTCATAAAATTGATAGTCTCCATTTTATATTTTAAGACTGAGGAAGTTCCATATATCATTTGATTGTACTTTCA

and this is the renamed and revcomp'ed version (taken from the Globus share), part of the chrY assembly FASTA:

>chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239
ATCTCACGGTGTTGAAACTTTATTTTTATTGAGCAATTTTGAACATTCCTTTTTATAGAATCTACAAGTGGATATTTGGA

Please try to be more precise when you still think there is an error.

ptrebert commented 2 years ago

Ok, I adapted it s.t. the sequence of the contig is dumped in original orientation; this is taken from the new output file (still needs to be copied to Globus):

>chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239
TAGATTTTCATAAAATTGATAGTCTCCATTTTATATTTTAAGACTGAGGAAGTTCCATATATCATTTGATTGTACTTTCA
pilleh commented 2 years ago

Thank you :) :) :)

ptrebert commented 2 years ago

I guess this one can be closed now