HMMER: additional seq motifs

ptrebert commented 2 years ago

Add the following motifs to HMMER runs (various purposes) FASTA files located under references

[x] DYZ19_Yq: located in Yq euchromatin, not for contig identification, just annotate chrY contigs
[x] DYZ3-prim_Ycentro: 171bp rep unit, not for contig identification, just annotate chrY contigs
[x] DYZ3-sec_Ycentro: 5941bp rep unit, threshold hits >1700 and use for contig identification
[ ] "TSPY": to be defined

ptrebert commented 2 years ago

config adapted; now distinguish between all motifs and those used for chrY contig identification:

- DYZ1_Yq
- DYZ18_Yq
- DYZ3-sec_Ycentro
- DYZ2_Yq

done c05672c

pilleh commented 2 years ago

@ptrebert Just wondering - do you use the '-E' parameter when you run hmmer for DYZ19 and DYZ3-prim? I know for a fact that the DYZ19 region is present in at least some of the samples (from the old hifiasm assemblies) where hmmer now doesn't identify any, so maybe it's a question of using less stringent parameters?

ptrebert commented 2 years ago

Well, of course I am using it as per your instructions for the HMMER run :-) Since these two motifs you mentioned are not relevant for contig identification, changing the run parameters is not much of an issue, so which parameters do you suggest here (19 and 3-prim)?

pilleh commented 2 years ago

@ptrebert For DYZ19 and DYZ3-prim I've been using '-E 1.60E-15'. This seems to pick up (most) of the correct chrY copies. In some ways using DYZ3-prim might be a bit pointless - it picks up huge amounts of stuff from other chromosomes. But I guess since it's already in the pipeline, you might as well leave it in as it might still be useful at some point.

ptrebert commented 2 years ago

Ok, to have that info centrally available, this is set in the pipeline:

SCORE_THRESHOLDS_MOTIF = {
    'DYZ1_Yq': 2500,
    'DYZ18_Yq': 2100,
    'DYZ2_Yq': 1700,
    'DYZ3-sec_Ycentro': 1700
}
# plus HMMER E-value of: 1.60E-150

And now the E-value for the motifs DYZ19 and DYZ3-prim should be increased to -E 1.60E-15, correct? @pilleh

pilleh commented 2 years ago

@ptrebert Yes, it all sounds good to me. You can also add a score threshold for DYZ3-prim of 90.

I also wonder - are you keeping the original hmmer output files? Would you be able to add the parse output to Globus as well, for the future runs at least?

ptrebert commented 2 years ago

Ok, I adapted the thresholds and let this rerun now.

I take the "parse" output to be this one

  --tblout <f>       : save parseable table of hits to file <f>

in which case, yes, this will be copied for all runs (past and future) from now on.

pilleh commented 2 years ago

Yes, this is great, thanks!

ptrebert commented 2 years ago

You find the tabular output in the .norm.tsv files

ptrebert commented 2 years ago

I received feedback from one of the HMMER devs, and they have now confirmed and identified the problem. They want to work on this tomorrow. Right now, no need to look for an alternative tool.

pilleh commented 2 years ago

@ptrebert sounds good. Fingers crossed they will do as they say. I've been trying to run hmmer with the Y sequence classes + repeats against hg02666 assembly to see if it would help with the annotation of Y regions. But even with super low -E threshold it does not complete in 2 days which is a bit disappointing. Anyway, just felt like complaining.

ptrebert commented 2 years ago

@pilleh what's actually with the "TSPY" that is still unchecked in the list above? If that has been dropped, this issue could be closed

pilleh commented 2 years ago

@ptrebert Yes, sorry, this is still undefined. I have a repeat unit for it, but I want to double-check it to make sure it is fine as is or should be slightly changed. I'll try to have this done today. If it's easy for you to add this to the pipeline once I have an acceptable repeat unit, then we can do that. Or I can just run HMMER with it separately on the assemblies - not a problem either.

pilleh commented 2 years ago

I don't think I will make it today :(

ptrebert commented 2 years ago

if it is just one more motif to add, it's no problem adding it later

pilleh commented 2 years ago

@ptrebert Okay, I finally have some more repeat units for HMMER. Do you prefer to add these to the pipeline, or I can run them myself?

Yqhet_2.7kb - 2.7kb repeat unit at the eu-Yqhet boundary, threshold hits >1500
Yqhet_3.1kb - 3.1kb repeat unit at the eu-Yqhet boundary, threshold hits >1500
TSPY - 20.2kb repeat unit, threshold hits >1000 [the thresholds are very generous just to be on the safe side]

plus HMMER E-value of: 1.60E-200 [hopefully this will reduce the running time a bit]

I've added the fasta sequences to HHU/references: Yqhet_2.7kb.fasta Yqhet_3.1kb.fasta TSPY.fasta

ptrebert commented 2 years ago

I'll add these to the pipeline, unless it's so urgent that you need the results today.

pilleh commented 2 years ago

Thanks :) Nope, I don't need them today. If you think the results could be ready some time let's say mid next week, then that would be awesome, as it would be nice to add something from these results to the Y update. I'll spend most of today on the grant renewal :S

ptrebert commented 2 years ago

the workflow engine was annoying me with name matching issues because of the "dots" (2.7kb and 3.1kb) in the motif names, and the easiest solution was to just rename your files:

Yqhet_2k7bp
Yqhet_3k1bp

ptrebert commented 2 years ago

for the record, the TSPY is so big that the HMMER runs have failed so far, so I need to submit those with much more resources. The other two are done and the results on the Globus share.

pilleh commented 2 years ago

@ptrebert How about if you decrease the E-value to: 5.00E-300 This should help but not sure how much.

Another option which you might like less is running only against chrY.

pilleh commented 2 years ago

@ptrebert I always feel bad to ask additional things but is it a lot of pain to run HMMER with all our repeat units and RepeatMasker also on the T2T HG002 Y sequence? I have run most of these separately on T2T Y, but probably would be better if it is done consistently, right?

ptrebert commented 2 years ago

I added the respective tasks to the pipeline for all HMMER motifs and RepeatMasker with T2T-HG002Y as target.

marschall-lab / project-male-assembly

HMMER: additional seq motifs #5