Open xvazquezc opened 2 years ago
Hi Xabi,
Thanks for bringing this to my attention.
Hmm, I'm not sure exactly what's happening. The line Unique gene sequences: 4817734
indicates to me that MMseqs2 is recognizing the input gene sequences. But the lines that look like ################ SKIPPING Identity 90 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
means that no annotations are successfully found through MMSeqs2. It's also strange that MMSeqs2 is running very fast -- the timestamp 16:36:34
is repeated, but there should be a few minutes between each. This tells me that MMSeqs2 is crashing early on.
As a sanity check, could you please tell me output of conda list
(run in the plasx conda environment)
Are you running on Linux? If so, what's the output of uname -a
?
I could try to debug by running your files on my end. If you're comfortable sharing files, you could email me the files gene-calls.txt
and de-novo-families.txt
(or just a small section of it, e.g. head -100 gene-calls.txt
). My email is mikeyu@ttic.edu.
Best, Mike
Hi Mike,
I realised I missed pasting the end of the error (I updated the original message). Also, I didn't mention but the de-novo-families.txt
is not generated. I tested it with a chunk of the gene-calls.txt
I'll send it to you for debugging (the full file is 1.4GB...).
$ conda list
# packages in environment at /home/z3382651/miniconda3/envs/plasx:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
blas 1.0 mkl anaconda
blosc 1.21.0 h9c3ff4c_0 conda-forge
bzip2 1.0.8 h7b6447c_0 anaconda
ca-certificates 2021.10.26 h06a4308_2
certifi 2021.10.8 py38h578d9bd_1 conda-forge
gawk 5.1.0 h7b6447c_0 anaconda
intel-openmp 2020.2 254 anaconda
joblib 0.17.0 py_0 anaconda
ld_impl_linux-64 2.33.1 h53a641e_7 anaconda
libedit 3.1.20191231 h14c3975_1 anaconda
libffi 3.3 he6710b0_2 anaconda
libgcc-ng 11.2.0 h1d223b6_12 conda-forge
libgfortran-ng 7.3.0 hdf63c60_0 anaconda
libgomp 11.2.0 h1d223b6_12 conda-forge
libllvm10 10.0.1 hbcb73fb_5 anaconda
libstdcxx-ng 11.2.0 he4da1e4_12 conda-forge
llvm-openmp 8.0.1 hc9558a2_0 conda-forge
llvmlite 0.34.0 py38h269e1b5_4 anaconda
lz4-c 1.9.2 heb0550a_3 anaconda
mkl 2019.4 243 anaconda
mkl-service 2.3.0 py38he904b0f_0 anaconda
mkl_fft 1.2.0 py38h23d657b_0 anaconda
mkl_random 1.1.0 py38h962f231_0 anaconda
mmseqs2 10.6d92c h2d02072_0 bioconda
ncurses 6.2 he6710b0_1 anaconda
numba 0.51.2 py38h0573a6f_1 anaconda
numpy 1.19.1 py38hbc911f0_0 anaconda
numpy-base 1.19.1 py38hfa32c7d_0 anaconda
openmp 8.0.1 0 conda-forge
openssl 1.1.1m h7f8727e_0
pandas 1.1.3 py38he6710b0_0 anaconda
pip 20.2.4 py38_0 anaconda
plasx 0.0.0 pypi_0 pypi
python 3.8.5 h7579374_1 anaconda
python-blosc 1.7.0 py38h7b6447c_0
python-dateutil 2.8.1 py_0 anaconda
python_abi 3.8 2_cp38 conda-forge
pytz 2020.1 py_0 anaconda
readline 8.0 h7b6447c_0 anaconda
scikit-learn 0.23.2 py38h0573a6f_0 anaconda
scipy 1.5.2 py38h0b6359f_0 anaconda
setuptools 50.3.0 py38hb0f4dca_1 anaconda
six 1.15.0 py_0 anaconda
sqlite 3.33.0 h62c20be_0 anaconda
tbb 2020.3 hfd86e86_0 anaconda
threadpoolctl 2.1.0 pyh5ca1d4c_0 anaconda
tk 8.6.10 hbc83047_0 anaconda
wheel 0.35.1 py_0 anaconda
xz 5.2.5 h7b6447c_0 anaconda
zlib 1.2.11 h7b6447c_3 anaconda
zstd 1.4.5 h9ceee32_0 anaconda
Yes, I'm using CentOS in our HPC (release 7.9.2009):
$ uname -a
Linux clive.ramaciotti.unsw.edu.au 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Hi Xabi,
I was able to run plasx search_de_novo_families ...
on your file successfully. Based on your log, my suspicion is that the MMseqs2 profiles for the de novo families were not downloaded correctly.
In the install location of PlasX, there should be a folder called data
. That should have a subfolder PlasX_mmseqs_profiles
with the following files
clu20.profile clu25.profile.dbtype clu30.profile_h clu40.profile_h.index clu50.profile.index clu60.profile clu70.profile.dbtype clu80.profile_h clu90.profile_h.index
clu20.profile.dbtype clu25.profile_h clu30.profile_h.index clu40.profile.index clu5.profile clu60.profile.dbtype clu70.profile_h clu80.profile_h.index clu90.profile.index
clu20.profile_h clu25.profile_h.index clu30.profile.index clu50.profile clu5.profile.dbtype clu60.profile_h clu70.profile_h.index clu80.profile.index rep_lengths.pkl.blp
clu20.profile_h.index clu25.profile.index clu40.profile clu50.profile.dbtype clu5.profile_h clu60.profile_h.index clu70.profile.index clu90.profile rep_lengths.txt
clu20.profile.index clu30.profile clu40.profile.dbtype clu50.profile_h clu5.profile_h.index clu60.profile.index clu80.profile clu90.profile.dbtype rep_min_align_identity.pkl.blp
clu25.profile clu30.profile.dbtype clu40.profile_h clu50.profile_h.index clu5.profile.index clu70.profile clu80.profile.dbtype clu90.profile_h rep_min_align_identity.txt
Can you check if these files are there? You can get the install location by running python -c "import os, plasx ; print(os.path.dirname(plasx.__file__))"
. If the files are there, please check you get the same md5 checksum as this file:
$ md5sum clu5.profile
34729adeebc33298061f88487306cde0 clu5.profile
If the files aren't there, then something might have messed up with you ran plasx setup
in this part of the tutorial. In that case, could you please rerun that step and then see if plasx search_de_novo_families
now works?
Best, Mike
Thanks Mike. The MMSeqs2 profiles are all right. What I didn't mention explicitly is that I set up the model files in a different location. I ran plasx setup
with -o /db/PlasX
When I was running plasx search_de_novo_families
I was using -db /db/PlasX
, but PlasX didn't like it despite indicating it on the setup. By using -db /db/PlasX/PlasX_mmseqs_profiles
it solved the issue.
Similarly, with plasx predict
I had to provide the full path to the coefficients file. I also tried to run it without specifying -m
as I expected that the output folder indicated in the setup would be stored as the location for the reference files, but that doesn't happen.
For confirmation, I ran the installation/setup with the same config in a different computar and the issue is the same. Not sure if it's the expected behaviour but I thought on letting you know.
Ahh. Thanks for figuring this out! I will update the code so that specifying the same path for plasx setup
and plasx search_de_novo_families
. I agree that this should be the expected behavior.
Hi, I have installed PlasX as indicated (deps via conda) but it always crashes when I run
search_de_novo_families
. This happens after MMSeqs2 runs (no problem there). I have tested changing several parameters (number splits/no splits, location of tmp folder...) but it always fails with the same error (see below). Any idea what it might be? Thanks, Xabi