too slow - Githubissues

ucassee commented 3 years ago

Hi,

I use the following command to run wtp with an input fasta file (~60M): nextflow run replikation/What_the_Phage --fasta wtp/all_combined.fasta --databases nextflow-autodownload-databases --cachedir singularity_images --output wtpresult --cores 20 -profile local,singularity -r v1.0.0

I find the hmmsearch step only running with one thread. I wonder whether some configure file I should modify to speed up this process.

It finished with errors and I attached the report. execution_report.zip

When I use the same command on a cluster (PBS system), it showed the following error:

Command error:
  INFO:    Convert SIF file to sandbox...
  ERROR  : Failed to create user namespace: user namespace disabled

How can I debug?

Thanks

replikation commented 3 years ago

edit: @ucassee i think you found a bug in our CPU ram config :) we check and report back. thanks

replikation commented 3 years ago

the issue was a missing hardware configuration for virsorter 2. we will push a hotfix release today to fix that

mult1fractal commented 3 years ago

Hey,

we fixed the config files where virsorter 2 was missing.

use -r v1.0.1 and re-execute your command

ucassee commented 3 years ago

Hi @replikation, I am not sure which process it was. But the command is like this hmmsearch -T 30 --tblout iter-0/all.pdg.faa.splitdir/all.pdg.faa.ss.1.split.Viruses.splithmmtbl --cpu 1 --noali -o /dev/null /db/hmm/viral/combined.hmm /tmp/vs2-K6zvoLzjXZlu/all.pdg.faa.ss.1.split I will try new version.

ucassee commented 3 years ago

Hi, There is still an error when I use an unprivileged account. How can I debug this?

Error executing process > 'identify_fasta_MSF:fasta_validation_wf:input_suffix_check (1)'

Caused by:
  Process `identify_fasta_MSF:fasta_validation_wf:input_suffix_check (1)` terminated with an error exit status (1)

Command executed:

  case "test.fasta" in
      *.gz) 
          zcat test.fasta > test.fa
          ;;
      *.fna)
          cp test.fasta test.fa
          ;;
      *.fasta)
          cp test.fasta test.fa
          ;;
      *.fa)
          ;;
      *)
          echo "file format not supported...what the phage...(.fa .fasta .fna .gz is supported)"
          exit 1
  esac

  # tr whitespace at the end of lines
  sed 's/[[:blank:]]*$//' -i test.fa
  # remove ' and "
  tr -d "'"  < test.fa | tr -d '"' | tr -d "[]" > tmp.file && mv tmp.file test.fa
  # replace ( ) | . , / and whitespace with _
  sed 's#[()|.,/ ]#_#g' -i test.fa
  # remove empty lines
  sed '/^$/d' -i test.fa

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  ERROR  : Failed to create user namespace: user namespace disabled

Thanks

replikation commented 3 years ago

i think that has to be configured from the e.g. cluster-admin side of things I think (https://github.com/hpcng/singularity/issues/5240)

ucassee commented 3 years ago

Hi, I met a new error. How can I debug? Thanks.

Error executing process > 'identify_fasta_MSF:fasta_validation_wf:input_suffix_check (1)'

Caused by:
  Process `identify_fasta_MSF:fasta_validation_wf:input_suffix_check (1)` terminated with an error exit status (255)

Command executed:

  case "all_pos_phage.fa" in
      *.gz) 
          zcat all_pos_phage.fa > all_pos_phage.fa
          ;;
      *.fna)
          cp all_pos_phage.fa all_pos_phage.fa
          ;;
      *.fasta)
          cp all_pos_phage.fa all_pos_phage.fa
          ;;
      *.fa)
          ;;
      *)
          echo "file format not supported...what the phage...(.fa .fasta .fna .gz is supported)"
          exit 1
  esac

  # tr whitespace at the end of lines
  sed 's/[[:blank:]]*$//' -i all_pos_phage.fa
  # remove ' and "
  tr -d "'"  < all_pos_phage.fa | tr -d '"' | tr -d "[]" > tmp.file && mv tmp.file all_pos_phage.fa
  # replace ( ) | . , / and whitespace with _
  sed 's#[()|.,/ ]#_#g' -i all_pos_phage.fa
  # remove empty lines
  sed '/^$/d' -i all_pos_phage.fa

Command exit status:
  255

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  FATAL:   while extracting /data/database/wtp/singularity_images/nanozoo-basics-1.0--962b907.img: root filesystem extraction failed: could not extract squashfs data, unsquashfs not found

Work dir:
   /data/Project/1.Mariana/4.virus/wtptempt/06/a6e46c15f82e227d71fd2c533ae0e1

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

hoelzer commented 3 years ago

This still looks like a problem with Singularity on your system:

Command error:
  INFO:    Convert SIF file to sandbox...
  FATAL:   while extracting /data/database/wtp/singularity_images/nanozoo-basics-1.0--962b907.img: root filesystem extraction failed: could not extract squashfs data, unsquashfs not found

What is your version?

singularity --version

Was singularity installed by a system administrator and configured appropriately?

You can also try if singularity works outside of the Nextflow framework of WtP:

singularity run /data/database/wtp/singularity_images/nanozoo-basics-1.0--962b907.img wget --version

ucassee commented 3 years ago

Hi @hoelzer, My singularity version is 3.6.3. I use conda to install it with my own account.

when run singularity run /data/database/wtp/singularity_images/nanozoo-basics-1.0--962b907.img wget --version with the same error:

INFO:    Convert SIF file to sandbox...
FATAL:   while extracting /data/database/wtp/singularity_images/nanozoo-basics-1.0--962b907.img: root filesystem extraction failed: could not extract squashfs data, unsquashfs not found

hoelzer commented 3 years ago

@ucassee okay the version should be fine.

But I experienced issues in the past when installing Singularity via conda on an HPC system where I'm not root. Are you running the pipeline on a (administrated) cluster machine? HPC, work station, ... or similar?

If so, I think you should ask your system admin to install Singularity properly with root access. E.g., I did it using the following manual/notes on my local machine:

https://hackmd.io/@GqOnlbqgSdKAMwgCUU_ljQ/rklBUfXRD

ucassee commented 3 years ago

Hi @hoelzer I will try to contact the system admin of our cluster. But when I run the wtp using our work station. It is still going error with some virus prediction tools. I uploaded one report. Can I debug this or it is okay to ignore them?

Thanks execution_report.zip

hoelzer commented 3 years ago

@ucassee it looks like three virus prediction tools failed:

seeker
virnet
pprmeta

Although that is not nice, it can happen based on your input that some tools will not work. But WtP will run through anyway with the other tools. When you run the test profile on your work station, do these three tools work in general? Than it's fine and no need for debugging

replikation commented 3 years ago

we can take a look at this, but as martin metnioned we "autoskip" tools if they fail for various reasons so you get actual results and are not annoyed with tons of bugs :)

We would need the "temporary dirs" to check out whats going on.

the following dirs would be of interest (located in the work dir):

40/f579e2*
8f/67a0ad*
4c/752a29*

inside are hidden files like .command.log. And an ls -lah per dir would also be great so you dont need to send us the whole fasta input - but we need to know which "files were present" during the error.

thanks

ucassee commented 3 years ago

Hi @replikation @hoelzer

Thanks for your reply. If you need any other files, please let me know. virnet.zip pprmeta.zip seeker_wf.zip

ucassee commented 3 years ago

Hi, I run the test profile, but there is still one error. I attached the report. Thanks phigaro.zip execution_report.zip

mult1fractal commented 3 years ago

I will look into it tomorrow

mult1fractal commented 3 years ago

Hey, unfortunately I was not able to reproduce your error with:

nextflow run phage.nf --cores 16 -profile local,smalltest,singularity

I checked the phigaro.zip file .command.err : WARNING: underlay of /etc/localtime required more than 50 (95) bind mounts

It seems this is linked to CentOS and using singularity... I will try to find a solution for this

ucassee commented 3 years ago

Hi @mult1fractal, Thanks for your effort, when you solve it please let me know. If I could find some clues, I will also report here.

ucassee commented 3 years ago

Hi @mult1fractal WTP seems to run all identifiers parallelly. I used a local server to run it and saw a heavy load at the beginning. Sometimes I use a bigger assembly file (>500MB), my server crashed and restarted. Do these relate to the error I met before? Thanks

replikation commented 3 years ago

@ucassee Could you please provide the command used? The amount of parallel runs is basically controlled via the cores flag in relation to the max_cores on a local run.

ucassee commented 3 years ago

Hi @replikation The following is the command:

nextflow run /data2017/.nextflow/assets/replikation/What_the_Phage --fasta ${i} \
             --cachedir /data2017/database/wtp/singularity_images \
             --databases /data2017/database/wtp/nextflow-autodownload-databases \
             --output wtpresult/${n} \
             --workdir wtptempt  \
             --cores 6 \
             -profile local,singularity \
             --filter 10000 --identify

The maximum of threads of my server is 80. But the load average of CPU could reach 130 at the beginning of the program.

ucassee commented 3 years ago

Hi all,

The .command.sh of the phigaro is the following:

#!/bin/bash -ue
phigaro -f Dive121-T2_filtered.fa -o output -t 6 --wtp --config /root/.phigaro/config.yml
cat output/phigaro.txt > output/phigaro_${PWD##*/}.txt 
echo "" >> output/phigaro_${PWD##*/}.txt

But there is no config file at /root/.phigaro/config.yml. So is this also related to the error I reported before.

ucassee commented 3 years ago

the error of phigaro can be debugged by usingphigaro-setup in the singularity environment.

But the trouble of identifiers running parallelly is still, even I set --max_cores 60

I attached the screen from top

top - 21:20:11 up  1:32,  5 users,  load average: 116.63, 92.43, 53.11
Tasks: 1024 total,  14 running, 909 sleeping, 101 stopped,   0 zombie
%Cpu(s): 86.7 us, 11.2 sy,  0.0 ni,  2.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13207409+total, 99502976+free, 22430808 used, 30328035+buff/cache
KiB Swap: 98302976 total, 98302976 free,        0 used. 12958008+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                 
106892 root      20   0 6372172 625320   7704 R  1319  0.0   1:42.10 python                                                                                                                                                                                                  
106895 root      20   0 6363724 619036   7676 R  1311  0.0   1:43.63 python                                                                                                                                                                                                  
106866 root      20   0 6368076 619040   7600 R  1291  0.0   1:41.38 python                                                                                                                                                                                                  
106888 root      20   0 6372172 617180   7744 R  1190  0.0   1:40.34 python                                                                                                                                                                                                  
 71231 root      20   0   24.5g   5.2g  79216 S 795.7  0.4  14:41.23 python3                                                                                                                                                                                                 
138375 root      20   0 1649704 176840   1316 R 155.7  0.0  48:44.64 hmmsearch                                                                                                                                                                                               
132719 root      20   0 1644468 166392   1316 R 146.2  0.0  32:39.79 hmmsearch                                                                                                                                                                                               
 69196 root      20   0   99036  21540   1228 S  96.7  0.0  11:20.96 hmmsearch                                                                                                                                                                                               
 68296 root      20   0   99720  24880   1228 R  92.8  0.0  11:22.26 hmmsearch                                                                                                                                                                                               
 69249 root      20   0  103696  26348   1228 R  89.8  0.0  11:12.14 hmmsearch                                                                                                                                                                                               
 97654 root      20   0  106324  29700   1228 R  87.2  0.0   9:13.81 hmmsearch                                                                                                                                                                                               
 68221 root      20   0   90276  12448   1228 R  86.6  0.0  11:32.91 hmmsearch                                                                                                                                                                                               
 68300 root      20   0  103144  26796   1228 R  86.2  0.0  11:18.66 hmmsearch                                                                                                                                                                                               
 68348 root      20   0   95124  19284   1228 S  86.2  0.0  11:03.74 hmmsearch                                                                                                                                                                                               
 69250 root      20   0   92940  13728   1228 S  83.9  0.0  11:24.31 hmmsearch                                                                                                                                                                                               
 69251 root      20   0   89316  13316   1228 R  83.9  0.0  10:55.22 hmmsearch                                                                                                                                                                                               
104050 root      20   0   94460  16416   1228 S  83.9  0.0   8:58.32 hmmsearch                                                                                                                                                                                               
 16320 root      20   0  972040 180936  79728 S  77.0  0.0  50:34.07 blastn                                                                                                                                                                                                  
 65276 root      20   0  115900  39780   1216 S  77.0  0.0  15:12.48 hmmsearch                                                                                                                                                                                               
 47826 root      20   0  114560  38300   1216 S  74.8  0.0  15:39.90 hmmsearch                                                                                                                                                                                               
 61666 root      20   0  119376  39856   1216 S  74.8  0.0  15:20.87 hmmsearch                                                                                                                                                                                               
 91333 root      20   0  120196  42424   1216 S  74.4  0.0  14:09.76 hmmsearch                                                                                                                                                                                               
 75127 root      20   0  113324  35476   1216 S  71.5  0.0  14:50.19 hmmsearch                                                                                                                                                                                               
 66739 root      20   0  122700  36488   1216 S  67.9  0.0  15:15.71 hmmsearch                                                                                                                                                                                               
 45614 root      20   0  135300  50320   1216 S  66.6  0.0  15:36.20 hmmsearch                                                                                                                                                                                               
 72592 root      20   0  131512  55280   1216 S  65.2  0.0  14:57.75 hmmsearch                                                                                                                                                                                               
108818 root      20   0   26252  12840   4468 R   4.9  0.0   0:00.15 sourmash

ucassee commented 3 years ago

The tensorflow version in pprmeta, virnet and seeker images are 2.3, but the old CPU that doesn't support avx will have Illegal instruction (core dumped) problem. Please see here https://github.com/tensorflow/tensorflow/issues/17411

I suggest you use tensorflow==1.5 to regenerate the image for compatibility with the older CPU. Thanks

mult1fractal commented 3 years ago

Okay... for virnet pprmeta and seeker: I can try to build the images with tensorflow==1.5 but I'm not sure if the tools will work with this version of tensorflow

for Phigaro: I'm not able to reproduce this error with your command you posted above, nor with the command I used:

nextflow run replikation/What_the_Phage --cores 16 -profile local,smalltest,singularity --dv --ma --mp --pp --sm --vf --vn --vs --vs2 --sk --vb --cachedir singularity_images/ --identify -r v1.0.1

I will try both commands with a larger Inputfile, maybe this causes the issue

ucassee commented 3 years ago

Hi @mult1fractal , For Phigaro error, I use singularity run multifractal-phigaro-0.5.2.img and phigaro-setup to generate the config file /root/.phigaro/config.yml.

I used a new server that supports avx, so I can get the results from all wrapped tools for some small input tfile. But for larger input file there is also an error.

WARNING: underlay of /etc/localtime required more than 50 (93) bind mounts
Using TensorFlow backend.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
WARNING: underlay of /etc/localtime required more than 50 (93) bind mounts
Using TensorFlow backend.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2021-02-02 07:13:17.357673: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-02-02 07:13:17.416864: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1895280000 Hz
2021-02-02 07:13:17.431704: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a62670 executing computations on platform Host. Devices:
2021-02-02 07:13:17.431779: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2021-02-02 07:13:17.580242: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Starting VirNet
Loading Data TS01-B03_fragments.fasta
Loaded 8256 fragments
Loading Tokenizer
Start Predictions

1024/8256 [==>...........................] - ETA: 35s
2048/8256 [======>.......................] - ETA: 30sTraceback (most recent call last):
  File "/virnet/predict.py", line 51, in <module>
    main()
  File "/virnet/predict.py", line 47, in main
    predictions=run_pred(model,x_data)
  File "/virnet/predict.py", line 20, in run_pred
    y_prop=model.predict(input_data)
  File "/virnet/NNClassifier.py", line 100, in predict
    return self.model.predict([X],batch_size=1024, verbose=1)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1462, in predict
    callbacks=callbacks)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 324, in predict_loop
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[523,599] = 3150 is not in [0, 3150)
         [[{{node embedding_1/embedding_lookup}}]]

ucassee commented 3 years ago

Hi @mult1fractal @replikation,

Virnet is designed for virus reads identification, not for assembly. Please see https://github.com/alyosama/virnet/issues/8 I think this is a cause of the virnet error, so I suggest you remove it from WTP. I see a new tool having good performance in virus identification. Also, attached here for you to consider. https://github.com/ablab/viralVerify

I am using wtp for my next project, you provide a powerful and convenient workflow. Best

mult1fractal commented 3 years ago

Hey @ucassee

before the input-fasta sequence gets to virnet, we split the fasta file into 3000bp chunks as suggested from the virnet-Dev.

Okay nice, I will check it and put it on our list of tools to integrate.

replikation / What_the_Phage

too slow #123