Open ptrebert opened 2 years ago
config adapted; now distinguish between all motifs and those used for chrY contig identification:
- DYZ1_Yq
- DYZ18_Yq
- DYZ3-sec_Ycentro
- DYZ2_Yq
done c05672c
@ptrebert Just wondering - do you use the '-E' parameter when you run hmmer for DYZ19 and DYZ3-prim? I know for a fact that the DYZ19 region is present in at least some of the samples (from the old hifiasm assemblies) where hmmer now doesn't identify any, so maybe it's a question of using less stringent parameters?
Well, of course I am using it as per your instructions for the HMMER run :-) Since these two motifs you mentioned are not relevant for contig identification, changing the run parameters is not much of an issue, so which parameters do you suggest here (19 and 3-prim)?
@ptrebert For DYZ19 and DYZ3-prim I've been using '-E 1.60E-15'. This seems to pick up (most) of the correct chrY copies. In some ways using DYZ3-prim might be a bit pointless - it picks up huge amounts of stuff from other chromosomes. But I guess since it's already in the pipeline, you might as well leave it in as it might still be useful at some point.
Ok, to have that info centrally available, this is set in the pipeline:
SCORE_THRESHOLDS_MOTIF = {
'DYZ1_Yq': 2500,
'DYZ18_Yq': 2100,
'DYZ2_Yq': 1700,
'DYZ3-sec_Ycentro': 1700
}
# plus HMMER E-value of: 1.60E-150
And now the E-value for the motifs DYZ19
and DYZ3-prim
should be increased to -E 1.60E-15
, correct?
@pilleh
@ptrebert Yes, it all sounds good to me. You can also add a score threshold for DYZ3-prim of 90.
I also wonder - are you keeping the original hmmer output files? Would you be able to add the parse output to Globus as well, for the future runs at least?
Ok, I adapted the thresholds and let this rerun now.
I take the "parse" output to be this one
--tblout <f> : save parseable table of hits to file <f>
in which case, yes, this will be copied for all runs (past and future) from now on.
Yes, this is great, thanks!
You find the tabular output in the .norm.tsv
files
I received feedback from one of the HMMER devs, and they have now confirmed and identified the problem. They want to work on this tomorrow. Right now, no need to look for an alternative tool.
@ptrebert sounds good. Fingers crossed they will do as they say. I've been trying to run hmmer with the Y sequence classes + repeats against hg02666 assembly to see if it would help with the annotation of Y regions. But even with super low -E threshold it does not complete in 2 days which is a bit disappointing. Anyway, just felt like complaining.
@pilleh what's actually with the "TSPY" that is still unchecked in the list above? If that has been dropped, this issue could be closed
@ptrebert Yes, sorry, this is still undefined. I have a repeat unit for it, but I want to double-check it to make sure it is fine as is or should be slightly changed. I'll try to have this done today. If it's easy for you to add this to the pipeline once I have an acceptable repeat unit, then we can do that. Or I can just run HMMER with it separately on the assemblies - not a problem either.
I don't think I will make it today :(
if it is just one more motif to add, it's no problem adding it later
@ptrebert Okay, I finally have some more repeat units for HMMER. Do you prefer to add these to the pipeline, or I can run them myself?
plus HMMER E-value of: 1.60E-200 [hopefully this will reduce the running time a bit]
I've added the fasta sequences to HHU/references: Yqhet_2.7kb.fasta Yqhet_3.1kb.fasta TSPY.fasta
I'll add these to the pipeline, unless it's so urgent that you need the results today.
Thanks :) Nope, I don't need them today. If you think the results could be ready some time let's say mid next week, then that would be awesome, as it would be nice to add something from these results to the Y update. I'll spend most of today on the grant renewal :S
the workflow engine was annoying me with name matching issues because of the "dots" (2.7kb
and 3.1kb
) in the motif names, and the easiest solution was to just rename your files:
Yqhet_2k7bp
Yqhet_3k1bp
for the record, the TSPY is so big that the HMMER runs have failed so far, so I need to submit those with much more resources. The other two are done and the results on the Globus share.
@ptrebert How about if you decrease the E-value to: 5.00E-300 This should help but not sure how much.
Another option which you might like less is running only against chrY.
@ptrebert I always feel bad to ask additional things but is it a lot of pain to run HMMER with all our repeat units and RepeatMasker also on the T2T HG002 Y sequence? I have run most of these separately on T2T Y, but probably would be better if it is done consistently, right?
I added the respective tasks to the pipeline for all HMMER motifs and RepeatMasker with T2T-HG002Y as target.
Add the following motifs to HMMER runs (various purposes) FASTA files located under
references