michaelkyu / PlasX

PlasX, a machine learning classifier for identifying plasmid sequences based on genetic architecture
GNU General Public License v3.0
29 stars 1 forks source link

crash right after MMSeqs2 #3

Open xvazquezc opened 2 years ago

xvazquezc commented 2 years ago

Hi, I have installed PlasX as indicated (deps via conda) but it always crashes when I run search_de_novo_families. This happens after MMSeqs2 runs (no problem there). I have tested changing several parameters (number splits/no splits, location of tmp folder...) but it always fails with the same error (see below). Any idea what it might be? Thanks, Xabi

$ plasx search_de_novo_families -db $PLASXDB -g gene-calls.txt -o de-novo-families.txt --threads $NCPUS --overwrite 
Created temporary directory: /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb
Using temporary directory: /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb (This will be deleted after execution)
0 (16:32:42), sequences: 4827161
createdb /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb/mmseqs/source_db.all.fa /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb/mmseqs/source_db.all 

[...]

Time for merging into /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb/mmseqs/source_db_h by mergeResults: 0h 0m 0s 580ms
Time for processing: 0h 0m 2s 56ms
Unique gene sequences: 4817734
THREADS******* 24
################ Running Identity 90 ###################### (16:36:33)
################ SKIPPING Identity 90 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 90: 0.00 minutes
THREADS******* 24
################ Running Identity 80 ###################### (16:36:34)
################ SKIPPING Identity 80 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 80: 0.00 minutes
THREADS******* 24
################ Running Identity 70 ###################### (16:36:34)
################ SKIPPING Identity 70 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 70: 0.00 minutes
THREADS******* 24
################ Running Identity 60 ###################### (16:36:34)
################ SKIPPING Identity 60 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 60: 0.00 minutes
THREADS******* 24
################ Running Identity 50 ###################### (16:36:34)
################ SKIPPING Identity 50 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 50: 0.00 minutes
THREADS******* 24
################ Running Identity 40 ###################### (16:36:34)
################ SKIPPING Identity 40 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 40: 0.00 minutes
THREADS******* 24
################ Running Identity 30 ###################### (16:36:34)
################ SKIPPING Identity 30 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 30: 0.00 minutes
THREADS******* 24
################ Running Identity 25 ###################### (16:36:34)
################ SKIPPING Identity 25 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 25: 0.00 minutes
THREADS******* 24
################ Running Identity 20 ###################### (16:36:34)
################ SKIPPING Identity 20 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 20: 0.00 minutes
THREADS******* 24
################ Running Identity 15 ###################### (16:36:34)
################ SKIPPING Identity 15 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 15: 0.00 minutes
THREADS******* 24
################ Running Identity 10 ###################### (16:36:34)
################ SKIPPING Identity 10 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 10: 0.00 minutes
THREADS******* 24
################ Running Identity 5 ###################### (16:36:34)
################ SKIPPING Identity 5 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34)
Total time to run identity 5: 0.00 minutes
Reading database info (16:36:34)
Deleting temporary directory: /scratch/pbs.15724.clive.ramaciotti.unsw.edu.au/tmp60s6birb
Traceback (most recent call last):
  File "/home/z3382651/miniconda3/envs/plasx/bin/plasx", line 8, in <module>
    sys.exit(run())
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 140, in run
    args.func(args)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 38, in search
    annotate_de_novo_families(args.gene_calls,
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1948, in annotate_de_novo_families
    hits = process_mmseqs_merge_search(mmseqs_source_db, target_db_dir, mmseqs_dir, ident_list,
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/mmseqs.py", line 1750, in process_mmseqs_merge_search
    t_len = utils.unpickle(target_db_dir / 'rep_lengths.pkl.blp').set_index('representative')['length']
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/compress_utils.py", line 288, in unpickle
    ret =  blosc_decompress(path_or_buf, stream=stream, obj_type='pickle', verbose=verbose)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/compress_utils.py", line 256, in blosc_decompress
    header = f.read(16)
AttributeError: 'PosixPath' object has no attribute 'read'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/z3382651/miniconda3/envs/plasx/bin/plasx", line 8, in <module>
    sys.exit(run())
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 140, in run
    args.func(args)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/plasx_script.py", line 21, in predict
    model = PlasX_model.from_table(args.model)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/model.py", line 52, in from_table
    df = utils.read_table(path).set_index('accession')['PlasX_coefficient']
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/plasx/pd_utils.py", line 1030, in read_table
    C = pd.read_table(A, **read_table_kws)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 765, in read_table
    return read_csv(**locals())
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 452, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 946, in __init__
    self._make_engine(self.engine)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 1178, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/z3382651/miniconda3/envs/plasx/lib/python3.8/site-packages/pandas/io/parsers.py", line 2008, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 537, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 711, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
michaelkyu commented 2 years ago

Hi Xabi,

Thanks for bringing this to my attention.

Hmm, I'm not sure exactly what's happening. The line Unique gene sequences: 4817734 indicates to me that MMseqs2 is recognizing the input gene sequences. But the lines that look like ################ SKIPPING Identity 90 (file doesn't exist probably it had no clusters after merging clusters ###################### (16:36:34) means that no annotations are successfully found through MMSeqs2. It's also strange that MMSeqs2 is running very fast -- the timestamp 16:36:34 is repeated, but there should be a few minutes between each. This tells me that MMSeqs2 is crashing early on.

As a sanity check, could you please tell me output of conda list (run in the plasx conda environment)

Are you running on Linux? If so, what's the output of uname -a?

I could try to debug by running your files on my end. If you're comfortable sharing files, you could email me the files gene-calls.txt and de-novo-families.txt (or just a small section of it, e.g. head -100 gene-calls.txt). My email is mikeyu@ttic.edu.

Best, Mike

xvazquezc commented 2 years ago

Hi Mike, I realised I missed pasting the end of the error (I updated the original message). Also, I didn't mention but the de-novo-families.txt is not generated. I tested it with a chunk of the gene-calls.txt I'll send it to you for debugging (the full file is 1.4GB...).

$ conda list
# packages in environment at /home/z3382651/miniconda3/envs/plasx:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
blas                      1.0                         mkl    anaconda
blosc                     1.21.0               h9c3ff4c_0    conda-forge
bzip2                     1.0.8                h7b6447c_0    anaconda
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.10.8        py38h578d9bd_1    conda-forge
gawk                      5.1.0                h7b6447c_0    anaconda
intel-openmp              2020.2                      254    anaconda
joblib                    0.17.0                     py_0    anaconda
ld_impl_linux-64          2.33.1               h53a641e_7    anaconda
libedit                   3.1.20191231         h14c3975_1    anaconda
libffi                    3.3                  he6710b0_2    anaconda
libgcc-ng                 11.2.0              h1d223b6_12    conda-forge
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libgomp                   11.2.0              h1d223b6_12    conda-forge
libllvm10                 10.0.1               hbcb73fb_5    anaconda
libstdcxx-ng              11.2.0              he4da1e4_12    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
llvmlite                  0.34.0           py38h269e1b5_4    anaconda
lz4-c                     1.9.2                heb0550a_3    anaconda
mkl                       2019.4                      243    anaconda
mkl-service               2.3.0            py38he904b0f_0    anaconda
mkl_fft                   1.2.0            py38h23d657b_0    anaconda
mkl_random                1.1.0            py38h962f231_0    anaconda
mmseqs2                   10.6d92c             h2d02072_0    bioconda
ncurses                   6.2                  he6710b0_1    anaconda
numba                     0.51.2           py38h0573a6f_1    anaconda
numpy                     1.19.1           py38hbc911f0_0    anaconda
numpy-base                1.19.1           py38hfa32c7d_0    anaconda
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1m               h7f8727e_0  
pandas                    1.1.3            py38he6710b0_0    anaconda
pip                       20.2.4                   py38_0    anaconda
plasx                     0.0.0                    pypi_0    pypi
python                    3.8.5                h7579374_1    anaconda
python-blosc              1.7.0            py38h7b6447c_0  
python-dateutil           2.8.1                      py_0    anaconda
python_abi                3.8                      2_cp38    conda-forge
pytz                      2020.1                     py_0    anaconda
readline                  8.0                  h7b6447c_0    anaconda
scikit-learn              0.23.2           py38h0573a6f_0    anaconda
scipy                     1.5.2            py38h0b6359f_0    anaconda
setuptools                50.3.0           py38hb0f4dca_1    anaconda
six                       1.15.0                     py_0    anaconda
sqlite                    3.33.0               h62c20be_0    anaconda
tbb                       2020.3               hfd86e86_0    anaconda
threadpoolctl             2.1.0              pyh5ca1d4c_0    anaconda
tk                        8.6.10               hbc83047_0    anaconda
wheel                     0.35.1                     py_0    anaconda
xz                        5.2.5                h7b6447c_0    anaconda
zlib                      1.2.11               h7b6447c_3    anaconda
zstd                      1.4.5                h9ceee32_0    anaconda

Yes, I'm using CentOS in our HPC (release 7.9.2009):

$ uname -a
Linux clive.ramaciotti.unsw.edu.au 3.10.0-1160.36.2.el7.x86_64 #1 SMP Wed Jul 21 11:57:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
michaelkyu commented 2 years ago

Hi Xabi,

I was able to run plasx search_de_novo_families ... on your file successfully. Based on your log, my suspicion is that the MMseqs2 profiles for the de novo families were not downloaded correctly.

In the install location of PlasX, there should be a folder called data. That should have a subfolder PlasX_mmseqs_profiles with the following files

clu20.profile          clu25.profile.dbtype   clu30.profile_h        clu40.profile_h.index  clu50.profile.index   clu60.profile          clu70.profile.dbtype   clu80.profile_h        clu90.profile_h.index
clu20.profile.dbtype   clu25.profile_h        clu30.profile_h.index  clu40.profile.index    clu5.profile          clu60.profile.dbtype   clu70.profile_h        clu80.profile_h.index  clu90.profile.index
clu20.profile_h        clu25.profile_h.index  clu30.profile.index    clu50.profile          clu5.profile.dbtype   clu60.profile_h        clu70.profile_h.index  clu80.profile.index    rep_lengths.pkl.blp
clu20.profile_h.index  clu25.profile.index    clu40.profile          clu50.profile.dbtype   clu5.profile_h        clu60.profile_h.index  clu70.profile.index    clu90.profile          rep_lengths.txt
clu20.profile.index    clu30.profile          clu40.profile.dbtype   clu50.profile_h        clu5.profile_h.index  clu60.profile.index    clu80.profile          clu90.profile.dbtype   rep_min_align_identity.pkl.blp
clu25.profile          clu30.profile.dbtype   clu40.profile_h        clu50.profile_h.index  clu5.profile.index    clu70.profile          clu80.profile.dbtype   clu90.profile_h        rep_min_align_identity.txt

Can you check if these files are there? You can get the install location by running python -c "import os, plasx ; print(os.path.dirname(plasx.__file__))". If the files are there, please check you get the same md5 checksum as this file:

$ md5sum clu5.profile
34729adeebc33298061f88487306cde0  clu5.profile

If the files aren't there, then something might have messed up with you ran plasx setup in this part of the tutorial. In that case, could you please rerun that step and then see if plasx search_de_novo_families now works?

Best, Mike

xvazquezc commented 2 years ago

Thanks Mike. The MMSeqs2 profiles are all right. What I didn't mention explicitly is that I set up the model files in a different location. I ran plasx setup with -o /db/PlasX

When I was running plasx search_de_novo_families I was using -db /db/PlasX, but PlasX didn't like it despite indicating it on the setup. By using -db /db/PlasX/PlasX_mmseqs_profiles it solved the issue.

Similarly, with plasx predict I had to provide the full path to the coefficients file. I also tried to run it without specifying -m as I expected that the output folder indicated in the setup would be stored as the location for the reference files, but that doesn't happen.

For confirmation, I ran the installation/setup with the same config in a different computar and the issue is the same. Not sure if it's the expected behaviour but I thought on letting you know.

michaelkyu commented 2 years ago

Ahh. Thanks for figuring this out! I will update the code so that specifying the same path for plasx setup and plasx search_de_novo_families. I agree that this should be the expected behavior.