ufcg train: .hmm files empty, markers missing from assemblies

alisqq commented 1 year ago

Hello,

I created a phylogenetic tree from 68 genome assemblies using ufcg profile & ufcg tree and the pipeline worked perfectly! Thank you! However, I wanted to add few markers that I know would help with resolving the phylogeny of the group I am working on.

To be exact, I wanted to use ufcg to create hmm profiles of 6 genes. For each marker I have around 250 fasta sequences. I tried to run different combinations of the following command ufcg train -i LOCI4TRAINING -g ../assemblies_renamed/ -o LOCI_HMM -t 8 -v and every genome from "assemblies renamed" is missing every marker and the .hmm files are empty.

The markers are extracted from the fungal genomes from the same family I want to resolve. I also tried to use one of the "core genes" that was extracted by ufcg profile from the assemblies and it also says that it's missing from all of the assemblies when I use it as the argument for ufcg train -i.

Could you point me to where I am making a mistake? Even if the markers are not present in any of the genome (which I doubt, especially with the core gene test), why are the .hmm profiles empty?

All best, Alis

endixk commented 1 year ago

Hello, thank you for using our pipeline and for your positive feedback! 😄

I apologize for the difficulty you encountered. Although I couldn't reproduce the error on my system using a toy dataset, I suspect that the module failed to build a multiple sequence alignment from the marker gene sequences you provided.

Please ensure that each FASTA file contains all the (putatively) homologous sequences of the marker gene you intend to train. Additionally, please verify that all the required binaries (fastBlockSearch, augustus, prepareAlign, msa2prfl.pl, mmseqs, and mafft) are properly listed on your environmental path.

If the issue persists, it would be greatly helpful if you could provide a subset of your data or any form of a toy dataset that can reproduce the problem via my email: endix1029@gmail.com.

Alternatively, you can run the pipeline with the --dev option, which will display all the commands the pipeline executes. This way, you can trace the steps and identify which specific step produces erroneous results. It's recommended to use a smaller dataset if you want to debug the process yourself, as the program will print out more detailed information.

Please let me know if there's anything else I can help you with.

alisqq commented 1 year ago

Thanks for the reply!

I run the command in developer mode on toy dataset as you suggested and it didn't throw any errors but .hmm profiles were empty as before so I tried running commands separately. Here are the results:

mafft seemed to work correctly mafft --auto EF1.00001.fa > EF1.00001.mafft
prepareAlign < EF1.00001.mafft > EF1.00001.pa.fa

PA_FULL_COL_WEIGHT=0.8 (default value)
PA_SKIP_COL_WEIGHT=0.2 (default value)
PA_MINSIZE=6 (default value)
PA_MIN_COL_COUNT=0 (default value)
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.75
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.7
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.65
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.6
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.55
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.5
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.45
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.4
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.35
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.3
ERROR: maximal reachable width is 0. Reduce PA_FULL_COL_WEIGHT.
Trying PA_FULL_COL_WEIGHT=0.25
ERROR: Finally giving up. Look at this file manually!

msa2prfl.pl < EF1.00001.pa.fa > EF1.00001.hmm

Qij file 'default.qij' could not be read.
Check the parameter --qij=<filename>.

Died at /home/users/alis/miniconda3/bin/msa2prfl.pl line 358.

I am not sure if there is sth wrong with the input file or somewhere along the way so I attach the input file and the ones created by the pipeline.

EF1.00001.fa.txt EF1.00001.mafft.txt EF1.00001.pa.fa.txt

endixk commented 1 year ago

Currently the module would accept and generate the models for protein sequences, however, your input contains nucleotide sequences. For train module I did not yet implement an auto-translate feature for the nucleotide inputs, so for now you may have to provide translated sequences as an input for now. Sorry for the inconvenience.

alisqq commented 1 year ago

Oh! I didn't realize this. I will translate my sequences and try again. Thank you very much for the answer!

steineggerlab / ufcg

ufcg train: .hmm files empty, markers missing from assemblies #17