Closed ptrebert closed 2 years ago
I have prototyped a very simple approach that determines contig order based on the contig alignment order, and orientation as a simple majority decision (= most of the contig aligns forward or revcomp to T2T). My proposal for the renaming would then be as follows:
Examples:
# if no two tigs span the same region(s), part number is always 1 (padded: 01)
chrY.21-23.01.AMPL7-HET3_Yq.FW.utig4-1293.HC18989
# enumerated part numbers to get a proper sort order
chrY.09-09.01.HET1_centro-HET1_centro.RV.utig4-1302.HC18989
chrY.09-09.02.HET1_centro-HET1_centro.RV.utig4-1408.HC18989
# if any: unaligned contigs get default values
# and thus end up at the end of any sorted list
chrY.99-99.00.UNK-UNK.NA.utigN-XXX.SAMPLE
@pilleh what do you think?
edit: added field 7 (sample name)
Sounds great, very informative names
I think I am going to append also the (simple) sample name at the end. This might be useful for analyses that compare all-vs-all sequences or the like...
this seems to work as intended
@ptrebert Just wanted to double-check - are you reversing (or planning to) Y contigs to the same orientation with the T2T Y? I was about to check the Y sequence classes of HG01890 as it looks pretty complete as well (and dump it to Mark&Miriam after :P), but it's in the reverse orientation. I can rev complement it unless it's easy to do it as part of the pipeline. What do you think?
Summary of the discussion about this (2022-03-29):
.RV.
as 5th component of their name (see 2nd post above) can be identified as being rc'ed relative to the original Verkko output sequence@ptrebert I've been thinking about this as well and I kept reaching the same conclusion - that it would just be easier if any chrY contigs were rc'd as early in the pipeline as possible, both in the WG and chrY sets. I don't know how to best set this up, but even re-running hmmer with the chrY motifs on whole genome assembly and RepeatMasker chrY after Y contigs are rc'd would be very useful, so at least all the coordinates would be consistent.
"the setup" is trivial, and yes, I also think a little bit of computational overhead to make everything consistent downstream is better. I had to spend a lot of time today debugging another project (to get this off my desk), but I think all of the above will be ready to update the current results starting tomorrow.
@ptrebert thanks and sounds good :) I hope I'll be able to do something more meaningful from tomorrow onwards... this week so far has almost entirely gone into replacing Qihui...
All available assemblies are now updated (wg and Y-only) with renamed and reoriented contigs if applicable. Our cluster is currently at capacity, so this may take a while.
Reported by Pille:
I have found a few chrY contigs which are mislabelled and I wonder what to do with these if anything (e.g. chrY.07-21.01.AMPL2-AMPL7.FW.unassigned-0001334.NA18989 - it actually only contains AMPL7). There are 2 others like this where just the label could be edited.
Checking the contig alignments reveals that there are two small alignments to AMPL2 (last two rows):
unassigned-0001334 1100666 0 999588 + chrY 62460029 23487298 24482110 981085 1013006 60
unassigned-0001334 1100666 1024114 1100666 - chrY 62460029 26931941 27008494 76532 76557 60
unassigned-0001334 1100666 1007007 1023490 - chrY 62460029 10311075 10319612 7171 17018 60
unassigned-0001334 1100666 1005701 1006704 - chrY 62460029 10319608 10320590 883 1018 60
Depending on how the other two cases look like (@pilleh please add them to this issue), the label could be made more appropriate by introducing more rules, e.g.
if less than X% of the contig aligns to a distant (non-neighboring) sequence class, omit this alignment from the new contig label
for the example above, ~97% of the contig aligns to AMPL7, so a possible threshold could be 5%
Hey, These are all 3 I have found so far (mislabelling): chrY.07-21.01.AMPL2-AMPL7.FW.unassigned-0001334.NA18989 - contains only AMPL7 chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492 - contains only AMPL7 chrY.07-21.01.AMPL2-AMPL7.RV.unassigned-0002422.HG02572 - contains only AMPL7
The one contig which should be reversed, is this: chrY.03-09.01.XTR1-HET1_centro.FW.unassigned-0009118.HG03492
If it is easy to add a rule like the one you suggest then yes sure. I suppose the risk is that it might cause problems elsewhere, but we won't know unless we try right
I'm sure overall my checks are limited as well - there is a good chance I'm not picking everything up...
For HG032492, we have this short alignment
unassigned-0009268 3464153 0 88652 - chrY 62460029 7230840 7317118 81193 90941 27
about ~2.5% of the contig length.
For HG02572, we have this:
unassigned-0002422 898225 106459 898225 - chrY 62460029 25317486 26102100 752899 823139 60
unassigned-0002422 898225 0 81936 + chrY 62460029 26926557 27008494 81912 81943 60
unassigned-0002422 898225 82560 99040 + chrY 62460029 10311075 10319612 7171 17018 60
unassigned-0002422 898225 99343 100346 + chrY 62460029 10319608 10320590 883 1018 60
about ~2% of the contig length.
I will change the renaming as follows then:
Does that make sense?
Re extracting chrX: is it sufficient to just get the contigs extracted (that would be trivial to achieve)?
Re extracting chrX: is it sufficient to just get the contigs extracted (that would be trivial to achieve)?
yes, I think this would be more than enough :)
For HG032492, we have this short alignment
unassigned-0009268 3464153 0 88652 - chrY 62460029 7230840 7317118 81193 90941 27
about ~2.5% of the contig length.
For HG02572, we have this:
unassigned-0002422 898225 106459 898225 - chrY 62460029 25317486 26102100 752899 823139 60 unassigned-0002422 898225 0 81936 + chrY 62460029 26926557 27008494 81912 81943 60 unassigned-0002422 898225 82560 99040 + chrY 62460029 10311075 10319612 7171 17018 60 unassigned-0002422 898225 99343 100346 + chrY 62460029 10319608 10320590 883 1018 60
about ~2% of the contig length.
I will change the renaming as follows then:
- compute the pct. alignment for all consecutive hits among the seq. classes
- in the cases above, XDR2 or AMPL2 are subsequences of length 1 (no neighboring seq. classes), with low alignment percentage (let's fix 5%)
- for renaming, these sequence classes are omitted
- a long contig should, however, not be split in two if there are alignment problems in small regions such as "other", i.e. AMPL1 - XTR2 - XDR2 - AMPL2 - <gap in aln / other1 is missed> - HET1_centro - XDR3 - AMPL3 should still be labeled as "AMPL1-AMPL3", and not omit either of the two sides.
Does that make sense?
Yes, I think it makes sense. There are a few samples with large inversions that kind of mess up the 'expected' sequence class order. But in case those get mislabelled, I should be able to pick it up. Will you re-run the contig labelling for all samples?
I will rerun that when the current batch is done (hopefully, that would then also include NA19317). Until then, you have time to identify more of these inverted cases, there is no other option but hard-coding those, I'm afraid
Re chrX: ok, will also be added when rerunning the entire thing.
thanks, sounds great :)
I have finished implementing the necessary changes, and the test for HG03492 gives the desired output (old vs new, all other contig names stay as-is):
9d8
< chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492
16a16
> chrY.21-21.04.AMPL7-AMPL7.RV.unassigned-0009268.HG03492
Of course, no way of knowing how this might affect other contig names, but hopefully the effects are limited.
identifying chrX contigs is available as of 195313a
note renamed script: determine_y_contigs.py
to identify_contigs.py
I have finished implementing the necessary changes, and the test for HG03492 gives the desired output (old vs new, all other contig names stay as-is):
9d8 < chrY.06-21.01.XDR2-AMPL7.RV.unassigned-0009268.HG03492 16a16 > chrY.21-21.04.AMPL7-AMPL7.RV.unassigned-0009268.HG03492
Of course, no way of knowing how this might affect other contig names, but hopefully the effects are limited.
Sounds good! How it affects other contig names - this should be pretty straightforward for me to check, so I'll check once you have the new names. You will re-run the whole contig identification part?
to be precise: the renaming will be re-run, but not the (sequence) identification. the computational overhead/aftermath is manageable, although not neglectable given the amount of data that we have by now
Okay, makes sense. I can just check how the naming matches with before and if there are major differences, then check more thoroughly.
I just started another test run for HC02666. The renaming (incl. chrX) now takes only a few minutes per assembly, i.e. the renamed assemblies should be available until the end of the week. I am also testing the error detection just for (haploid) chrX and chrY now. That requires a new read alignment because of the potentially changed names, so this will certainly take a few days to finish.
just for documentation purposes, and since this is the only contig that you identified as actually inverted, I hard-coded the following exception:
unassigned-0009118.HG03492
will be revcomp'ed if aligned in .FW.
orientation (that should be the case); in this case, the name will also be adapted to contain .RV.
indicating a reverse orientation relative to T2T Y@ptrebert Another sample with the large inversion in Yp - NA19239, and this contig should be reversed: chrY.04-09.01.AMPL1-HET1_centro.RV.unassigned-0002359.NA19239
ok, I have modified the renaming script again for this sample/contig and gonna rerun that.
@ptrebert thank you!
@ptrebert I'm so sorry, I was really hoping I would not have to bother you any longer. I've gone through all the other samples where the contig names/order has been changed and they all look fine. But something funny has happened with NA19239. This contig that should have been inverted (chrY.04-09.01.AMPL1-HET1_centro.RV.unassigned-0002359.NA19239), is now substantially larger in size. It was 3645135 bp before, but now in the new version is 9955618 bp. If you have a bit of time, can you check what might have happened?
I found the problem; logic error because of the different orientation flip compared to HG03492. I am rerunning things now, sorry for the delay.
Oh come on, you are on holidays... thanks a lot for looking into it!
New FASTA record of the NA19239 chrY on the Globus share:
chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239 3645135 [ ... ]
So that is as it should be now, I presume.
Hey! I'm so sorry to bother you again, this contig now otherwise looks fine, but it should still be reversed. This one: chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239
Pille, I may not understand what you want. The sequence has been revcomp'ed when it was renamed; this is the original sequence in the untouched Verkko assembly (first 80 symbols):
>unassigned-0002359
TAGATTTTCATAAAATTGATAGTCTCCATTTTATATTTTAAGACTGAGGAAGTTCCATATATCATTTGATTGTACTTTCA
and this is the renamed and revcomp'ed version (taken from the Globus share), part of the chrY assembly FASTA:
>chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239
ATCTCACGGTGTTGAAACTTTATTTTTATTGAGCAATTTTGAACATTCCTTTTTATAGAATCTACAAGTGGATATTTGGA
Please try to be more precise when you still think there is an error.
Ok, I adapted it s.t. the sequence of the contig is dumped in original orientation; this is taken from the new output file (still needs to be copied to Globus):
>chrY.04-09.01.AMPL1-HET1_centro.FW.unassigned-0002359.NA19239
TAGATTTTCATAAAATTGATAGTCTCCATTTTATATTTTAAGACTGAGGAAGTTCCATATATCATTTGATTGTACTTTCA
Thank you :) :) :)
I guess this one can be closed now
Order (orient) identified chrY contigs on the basis of the contig-to-reference alignments relative to T2T. Contigs should be renamed; coordinate with Pille about naming convention.
Pille: