Filelock safeguard error when run mixcr

KunFang93 commented 2 years ago

Hi,

After solving patch error (issue#14) by installing patch in our computation node, the pipeline run into a new error

Caused by:
  Process `mixcr (Patient353_T1star : tumor_DNA)` terminated with an error exit status (24)

Command executed:

  mixcr analyze shotgun \
      --threads 40 \
      --species hs \
      --starting-material dna \
      --only-productive \
      Patient353_T1star_tumor_DNA_trimmed_R1.fastq.gz Patient353_T1star_tumor_DNA_trimmed_R2.fastq.gz \
      Patient353_T1star_tumor_DNA_mixcr

Command exit status:
  24

Command output:
  ERROR: File lock safeguard was triggered. Please report this error to support@milaboratories.com.

Command wrapper:
  ERROR: File lock safeguard was triggered. Please report this error to support@milaboratories.com.

Work dir:
  /scratch/u/kfang/ChenHZ_lab/Neoantigen/test2/work/c3/5a4b303cc544c8f790079d0754d082

I wondered if your have any idea how to solve this problem?

Best, Kun

riederd commented 2 years ago

Hi, we never ran into this problem. It seems to be an issue with mixcr for which we can not do much about. Which kind of filesystem is /scratch/u/kfang/ChenHZ_lab/Neoantigen/test2/work using?

KunFang93 commented 2 years ago

Hi

Our admin of Cluster doesn't know how to solve this issue too...The filesystem is nfs. Thanks~

riederd commented 2 years ago

Is the NFS lock daemon running on the system? Usually it should, but maybe you can check this, as well.

You can also manually try to run the mixcr process to see if it was only a transient problem by doing the following:

cd  /scratch/u/kfang/ChenHZ_lab/Neoantigen/test2/work/c3/5a4b303cc544c8f790079d0754d082
bash .command.run

If that works you may resume nextNEOpi with -resume

If you can not fix the issue, you can also skip the TCR stuff by using --TCR false

HTH

KunFang93 commented 2 years ago

Thanks for your suggestion! Will try.

KunFang93 commented 2 years ago

Hi,

I tried skipping the TCR stuff by using --TCR false. The pipeline works fine initially but stuck in the MarkDuplicates just like issue #17 . I wondered if there is anything I could do to solve the problem? Thanks for your help!

Best, Kun

riederd commented 2 years ago

Hi, this is strange.

What happens when you cd to the work directory of the MarkDuplicates process and run the .command.run script manually?

First use ctrl+c to stop the pipeline, then look into the .nextflow.log file and get the work dir for the MarkDuplicates process. You might want to look for something like TaskHandler[id: 70; name: MarkDuplicates and note down the directory listed after workDir:

Then cd into this directory and run bash .command.run. You can monitor the activity with top

Can you also sent the output of ls -la in that workDir

KunFang93 commented 2 years ago

Hi,

Thanks for your reply! This is the output of ls -al in the workDir

(base) [kun@g1400png-ap01lp 1f1771d1843dfa04c9ab2159038b5a]$ ls -la
total 48
drwxrwxr-x 2 kun kun  4096 Nov  8 14:35 .
drwxrwxr-x 3 kun kun  4096 Oct 26 12:00 ..
-rw-rw-r-- 1 kun kun     0 Nov  8 14:35 .command.begin
-rw-rw-r-- 1 kun kun   946 Nov  8 14:35 .command.err
-rw-rw-r-- 1 kun kun  1490 Nov  8 14:32 .command.log
-rw-rw-r-- 1 kun kun     0 Nov  8 14:35 .command.out
-rw-rw-r-- 1 kun kun 11019 Oct 26 12:13 .command.run
-rw-rw-r-- 1 kun kun   650 Oct 26 12:13 .command.sh
-rw-rw-r-- 1 kun kun     0 Nov  8 14:35 .command.trace
lrwxrwxrwx 1 kun kun    97 Nov  8 14:35 GRCh38.d1.vd1.dict -> /data/kun/software/nextNEOpi/resources/references/hg38/gdc/GRCh38.d1.vd1/fasta/GRCh38.d1.vd1.dict
lrwxrwxrwx 1 kun kun    95 Nov  8 14:35 GRCh38.d1.vd1.fa -> /data/kun/software/nextNEOpi/resources/references/hg38/gdc/GRCh38.d1.vd1/fasta/GRCh38.d1.vd1.fa
lrwxrwxrwx 1 kun kun    99 Nov  8 14:35 GRCh38.d1.vd1.fa.fai -> /data/kun/software/nextNEOpi/resources/references/hg38/gdc/GRCh38.d1.vd1/fasta/GRCh38.d1.vd1.fa.fai
lrwxrwxrwx 1 kun kun   138 Nov  8 14:35 Patient353_T1star_normal_DNA_aligned_uBAM_merged.bam -> /data/kun/ChenHZ_lab/Neoantigens/patient353/T1/work/bb/6282fe6f5845e1a2dc962465ab05c4/Patient353_T1star_normal_DNA_aligned_uBAM_merged.bam

When I am trying to run bash .command.run, the screen freezes with the output

(base) [kun@g1400png-ap01lp 1f1771d1843dfa04c9ab2159038b5a]$ bash .command.run

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
22:35:42.344 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Nov 08 22:35:42 UTC 2022] SetNmMdAndUqTags --INPUT /dev/stdin --OUTPUT Patient353_T1star_normal_DNA_aligned_sort_mkdp.bam --TMP_DIR /tmp/Kun/nextNEOpi --VALIDATION_STRINGENCY LENIENT --MAX_RECORDS_IN_RAM 4194304 --CREATE_INDEX true --REFERENCE_SEQUENCE GRCh38.d1.vd1.fa --IS_BISULFITE_SEQUENCE false --SET_ONLY_UQ false --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

when I use top, I only see java process

top - 14:58:45 up 326 days,  5:51,  3 users,  load average: 5.13, 4.99, 4.96
Tasks: 601 total,   1 running, 600 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.8 us,  0.8 sy,  0.0 ni, 89.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 39465558+total,  1187048 free, 14884032 used, 37858451+buff/cache
KiB Swap:  2094076 total,  1415656 free,   678420 used. 37851708+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 67928 yufan     20   0 5604544   5.2g   1176 S 399.3  1.4   2578:19 bwa
240190 yufan     20   0   36.2g   3.3g  19228 S 107.6  0.9   1085:30 java
  5215 gdm       20   0  806208  87512    704 S   2.6  0.0   6112:34 gsd-color
272272 kun       20   0  162548   2816   1588 R   0.7  0.0   0:00.48 top
248733 kun       20   0   70.8g 373588  15280 S   0.3  0.1   0:12.51 java
     1 root      20   0  192024   3364   1632 S   0.0  0.0  11:28.58 systemd
     2 root      20   0       0      0      0 S   0.0  0.0   0:39.49 kthreadd

However, I checked with ps -ax. It looks like there are several commands is submitted

248492 pts/0    S+     0:00 bash .command.run
248513 pts/0    S+     0:00 tee .command.out
248514 pts/0    S+     0:00 tee .command.err
248515 pts/0    S+     0:00 bash .command.run
248518 pts/0    Sl+    0:00 Singularity runtime parent
248539 pts/0    S+     0:00 /bin/bash /data/kun/ChenHZ_lab/Neoantigens/patient353/T1/work/89/1f1771d1843dfa04c9ab2159038b5a/.command.run nxf_trace
248551 ?        S<     0:00 [loop0]
248575 pts/0    S+     0:00 /bin/bash -ue /data/kun/ChenHZ_lab/Neoantigens/patient353/T1/work/89/1f1771d1843dfa04c9ab2159038b5a/.command.sh
248577 pts/0    S+     0:06 /bin/bash /data/kun/ChenHZ_lab/Neoantigens/patient353/T1/work/89/1f1771d1843dfa04c9ab2159038b5a/.command.run nxf_trace
248584 pts/0    Sl+    6:39 sambamba markdup -t 20 --tmpdir /tmp/Kun/nextNEOpi --hash-table-size=1048576 --overflow-list-size=1000000 --io-buffer-size=1024 Patient353_T1s
248585 pts/0    S+     0:00 samtools sort -@20 -m 8G -O BAM -l 0 /dev/stdin
248586 pts/0    S+     0:00 python /opt/conda/bin/gatk --java-options -Xmx64G SetNmMdAndUqTags --TMP_DIR /tmp/Kun/nextNEOpi -R GRCh38.d1.vd1.fa -I /dev/stdin -O Patient35
248733 pts/0    Sl+    0:12 java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.co

I then use the following code to check if other PID is running

if ps -p $1 > /dev/null
then
   echo "$1 is running"
   # Do something knowing the pid exists, i.e. the process with $PID is running
fi

and found that 248584, 248585, 248586 is running.

Weird....Please let me know if any information is needed. Thanks for your help!

riederd commented 2 years ago

Hmm... can you check if /tmp is running out of space when the MarkDuplicates process is running

riederd commented 2 years ago

I the problem could be related to a memory limit, can you please post the contents of /data/kun/ChenHZ_lab/Neoantigens/patient353/T1/work/89/1f1771d1843dfa04c9ab2159038b5a/.error.log ?

Try to reserve more memory in slurm for the process by setting something like:

withName:MarkDuplicates {
    cpus = 4
    memory = "96 GB"
 }

in conf/process.config

KunFang93 commented 2 years ago

Sorry for the late reply. I don't see .error.log in the folder

(base) [kun@g1400png-ap01lp 1f1771d1843dfa04c9ab2159038b5a]$ less .
./              ../             .command.begin  .command.err    .command.log    .command.out    .command.run    .command.sh     .command.trace  .exitcode

Ok, I will try it with modified config file. Since currently we found alternative way to predict neoantigens, I will try your suggestion and report the results later in case other run into same problem. Thanks for your time and help again!

icbi-lab / nextNEOpi

Filelock safeguard error when run mixcr #16