Multiple sample mode Error featureCounts, problem with samtools sort output

kevinpryan commented 2 years ago

Hi there, I'm looking forward to running NeoFuse on my samples. However, I’m having several issues running it on a HPC using SLURM. I am getting the following output when I run NeoFuse using paired-end reads on multiple sample mode:

chmod: cannot access ‘/mnt/samples_neofuse2_temp.tsv’: No such file or directory
[-------------------------------- [NeoFuse] --------------------------------]

[14:38:43]  Paired End (PE) Reads detected: commencing processing
[14:38:43]  Processing files sample1_R1_001.fastq.gz - sample1_R2_001.fastq.gz
[14:38:43]  STAR Run started
[14:38:43]  Arriba Run started
[E::hts_open_format] Failed to open file /mnt/out/multisample_2sample_test/sample1/STAR/sample1.Aligned.sortedByCoord.out.bam
samtools index: failed to open "/mnt/out/multisample_2sample_test/sample1/STAR/sample1.Aligned.sortedByCoord.out.bam": No such file or directory
[15:31:39]  YARA Run started
[15:42:41]  OptiType Run started
[15:43:00]  featureCounts Run started
An error occured during featureCounts run, check /data/kryan/rna_seq_bc/results/neofuse/multisample_2sample_test/sample1/LOGS/sample1.featureCounts.log for more details
rm: cannot remove '/mnt/samples_neofuse2_temp.tsv': Read-only file system
Elapsed time: 0 days 01 hr 04 min 24 sec

When I look at sample1.featureCounts.log, I get:

ERROR: invalid parameter: '/mnt/out/multisample_2sample_test/sample1/STAR/sample1.Aligned.sortedByCoord.out.bam'

It looks like the issue is with samtools sort, as the file sample1.Aligned.sortedByCoord.out.bam has not been generated, although it should have been created on line 247 of NeoFuse_multi.sh

Here’s the batch script I have been using:

#!/bin/bash
#SBATCH --time=3-16:00:00
#SBATCH --job-name="neofuse"
#SBATCH --output=neofuse_test_2sample_4thgo.out
#SBATCH --mail-type=ALL
#SBATCH --mem-per-cpu=60G   # memory per cpu-core
#SBATCH -n 1    # no of tasks to run
#SBATCH -N 1    # no of nodes to use
#SBATCH -p normal
start_time=$SECONDS

# date started = 17/02/2022

cd /data/kryan/sw/NeoFuse

# run neofuse on 2 tumour samples, using default parameters, try to get 60G memory and run on 1 node

./NeoFuse -i /data/kryan/rna_seq_bc/samples_neofuse2.tsv \
    -s /data/kryan/reference/STAR_idx/ \
    -g /data/kryan/reference/GRCh38.primary_assembly.genome.fa \
    -a /data/kryan/reference/gencode.v31.annotation.gtf \
    -n 8 \
    -o /data/kryan/rna_seq_bc/results/neofuse/multisample_2sample_test/ \
    --singularity

elapsed=$(( SECONDS - start_time ))
eval "echo Elapsed time: $(date -ud "@$elapsed" +'$((%s/3600/24)) days %H hr %M min %S sec')"

Here’s what samples_neofuse2.tsv looks like:

#ID Read1   Read2
sample1 /data/kryan/rna_seq_bc/raw_reads2/sample1_R1_001.fastq.gz   /data/kryan/rna_seq_bc/raw_reads2/sample1_R2_001.fastq.gz
sample2 /data/kryan/rna_seq_bc/raw_reads2/sample2_R1_001.fastq.gz   /data/kryan/rna_seq_bc/raw_reads2/sample2_R2_001.fastq.gz

Singularity version: 3.4.1-4.2.ohpc.1.3.9

abyssum commented 2 years ago

Hello @kevinpryan,

Thank you for pointing this out. I will try to reproduce the error and probably issue a hotfix promptly, in the meantime, you can try to run your analysis on single sample mode. Since you are working on an HPC infrastructure this would be optimal for your set-up as you can parallelize your jobs (multi-sample processing does not allow parallelization).

Something like the command below should do the trick:

./NeoFuse -1 /data/kryan/rna_seq_bc/raw_reads2/sample1_R1_001.fastq.gz \
    -2 /data/kryan/rna_seq_bc/raw_reads2/sample1_R2_001.fastq.gz \
    -s /data/kryan/reference/STAR_idx/ \
    -g /data/kryan/reference/GRCh38.primary_assembly.genome.fa \
    -a /data/kryan/reference/gencode.v31.annotation.gtf \
    -n 8 \
    -o /data/kryan/rna_seq_bc/results/neofuse/singleSample_sample1_test/ \
    --singularity

Let me know if that works for you.

kevinpryan commented 2 years ago

Hi @abyssum,

Thanks for your prompt response. I have also been trying to run this on single sample mode as mentioned in #9 (although not in parallel). However, it fails to create the final output on certain samples (the first time I ran it, I got this error on 8 out of 24 samples).

Here is the output message I get for a sample that fails:

[-------------------------------- [NeoFuse] --------------------------------]

[23:25:27]  Paired End (PE) Reads detected: commencing processing
[23:25:27]  Processing files sample1_R1_001.fastq.gz - sample1_R2_001.fastq.gz
[23:25:27]  STAR Run started
[23:25:27]  Arriba Run started
[00:52:54]  YARA Run started
[01:02:39]  OptiType Run started
[01:03:09]  featureCounts Run started
[01:04:09]  Converting Raw Counts to TPM and FPKM
[01:04:12]  Searching for peptides of length 8 
[01:04:12]  MHCFlurry Run started
[01:04:42]  Creating Final Ouptut
An error occured while creating the final output files, check /data/kryan/sw/NeoFuse/./sample1/LOGS/sample1.final.log for more details

Here is what sample1.final.log looks like:

Traceback (most recent call last):
  File "/usr/local/bin/source/build_temp.py", line 114, in <module>
    final_out(inFile, outFile)
  File "/usr/local/bin/source/build_temp.py", line 52, in final_out
    with open(file) as csv_file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/out/./sample1/NeoFuse/tmp/sample1_8_gene1_gene2_1_8.tsv'

In this case, when I go to the output directory, this file does seem to be present under /data/kryan/sw/NeoFuse/sample1/NeoFuse/tmp/sample1_8_gene1_gene2_1_8.tsv

I am using the following batch script:

#!/bin/bash
#SBATCH --time=3-16:00:00
#SBATCH --job-name="neofuse"
#SBATCH --output=neofuse_allsamples.out
#SBATCH --mail-type=ALL
#SBATCH --mem-per-cpu=60G   # memory per cpu-core

cd /data/kryan/sw/NeoFuse

files="/data/kryan/rna_seq_bc/samples_neofuse_all.tsv"

# run neofuse on all samples

while IFS=$'\t' read -r -a myArray
do
 echo "reading file..."
 echo "${myArray[0]}"
 var=$(date)
 echo "$var"
 ./NeoFuse -1 "${myArray[0]}" \
    -2 "${myArray[1]}" \
    -s /data/kryan/reference/STAR_idx/ \
    -g /data/kryan/reference/GRCh38.primary_assembly.genome.fa \
    -a /data/kryan/reference/gencode.v31.annotation.gtf \
    -n 8 \
    --singularity
done < $files

The first 2 lines of samples_neofuse_all.tsv:

/data/kryan/raw_reads/sample1_R1_001.fastq.gz   /data/kryan/raw_reads/sample1_R2_001.fastq.gz
/data/kryan/raw_reads/sample2_R1_001.fastq.gz   /data/kryan/raw_reads/sample2_R2_001.fastq.gz

Do you know why it might be failing on some samples and not others?

Another thing to keep in mind while debugging this is that the Docker image may fail to build - we have found that Line 103 of the Dockerfile (installing the tables package) seems to be the issue, as when this is removed, the image builds to completion (but then the pipeline fails at the Optitype step as far as I can remember).

Should I be opening 2 new issues here?

abyssum commented 2 years ago

Hello @kevinpryan,

Once again thank you for the detailed feedback, it helps a lot. This behavior is a bit strange tbh, and I have never seen this error before (nor can I reproduce it locally). If the file is produced and is present in the /NeoFuse/tmp/ directory, then it should be processed with no issues. Did you try to re-run the failing samples (I suggest you rm - rf the output directory before re-running, as temp files might mess with the new run)? If yes, did you get the same error (I mean for this specific gene fusion)? Also, is there any possibility that there are filename collisions in your input files/samples?

Can you try to do the following:

create a new tsv file with some/all of the failing samples, eg. failing_samples_neofuse.tsv:

sample1 /data/kryan/raw_reads/sample1_R1_001.fastq.gz   /data/kryan/raw_reads/sample1_R2_001.fastq.gz
sample2 /data/kryan/raw_reads/sample2_R1_001.fastq.gz   /data/kryan/raw_reads/sample2_R2_001.fastq.gz

Change your batch file to smth like:

files="/data/kryan/rna_seq_bc/failing_samples_neofuse.tsv"

while IFS=$'\t' read -r -a myArray
do
 echo "reading file..."
 echo "${myArray[1]}"
 var=$(date)
 echo "$var"
 ./NeoFuse -1 "${myArray[1]}" \
    -2 "${myArray[2]}" \
    -d "${myArray[0]}" \
    -o /data/kryan/test_out/ \
    -s /data/kryan/reference/STAR_idx/ \
    -g /data/kryan/reference/GRCh38.primary_assembly.genome.fa \
    -a /data/kryan/reference/gencode.v31.annotation.gtf \
    -n 8 \
    --singularity
done < $files

Where /test_out/ is a directory where you have read/write permissions.

Then try to queue your job and tell me if you get the same error, and I will also try to reproduce this locally.

Thank you for bringing the Docker image issue to my attention, I will look into it as well (I'll open the issue, don't worry about that).

abyssum commented 2 years ago

Hello @kevinpryan,

The Docker image build and the multi-sample mode issues are now resolved with the newest release #13. As for the single-sample issue, if the error with the missing file persists please open a new issue.

I am closing this issue, feel free to reopen it if something is not working for you with the newest release.

kevinpryan commented 2 years ago

Hi @abyssum,

Thanks for all your help. Hope you had a good weekend.

In terms of the single-sample issue, I re-ran the samples the way you suggested a couple of days ago and got the same error. With the Docker image working, I tried increasing the sleep time from 30 to 120 seconds on line 457 of NeoFuse_single.sh, and it looks like it is now working on the samples that were failing (still in progress, but has completed successfully for the first 6 of the 8 samples). I think that it needed a bit more time for the inputs to build_temp.py to appear for certain samples if that makes sense.

abyssum commented 2 years ago

Oh, ok! That makes a lot of sense actually. Let me know when/if all of the samples you are running go through without any issues and I'll release a hotfix asap.

Thank you so much for testing and reporting!

kevinpryan commented 2 years ago

The rest of the samples worked thankfully!

I have a fork in which I made the change to both NeoFuse_multi.sh and NeoFuse_single.sh, so I can open a pull request. You might want to implement the fix differently, but the PR could be useful for reference anyway.

icbi-lab / NeoFuse

Multiple sample mode Error featureCounts, problem with samtools sort output #11