s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

"no such file or directory" error #254

Open Emmashipman opened 1 year ago

Emmashipman commented 1 year ago

Hello,

I am trying to use HTStream. I have based my code off the tutorial here: https://ucdavis-bioinformatics-training.github.io/2020-mRNA_Seq_Workshop/data_reduction/01-preproc_htstream_mm

and I referred to the sbatch example script heavily:

#!/bin/bash

#SBATCH --job-name=htstream # Job name
#SBATCH --nodes=1
#SBATCH --ntasks=9
#SBATCH --time=60
#SBATCH --mem=3000 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --partition=production
#SBATCH --reservation=mrnaseq_workshop
#SBATCH --account=mrnaseq_workshop
#SBATCH --array=1-22
#SBATCH --output=slurmout/htstream_%A_%a.out # File to which STDOUT will be written
#SBATCH --error=slurmout/htstream_%A_%a.err # File to which STDERR will be written
#SBATCH --mail-type=ALL
#SBATCH --mail-user=myemail@email.com

start=`date +%s`
echo $HOSTNAME
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

sample=`sed "${SLURM_ARRAY_TASK_ID}q;d" samples.txt`

inpath="00-RawData"
outpath="01-HTS_Preproc"
[[ -d ${outpath} ]] || mkdir ${outpath}
[[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample}

echo "SAMPLE: ${sample}"

module load htstream/1.3.2

call="hts_Stats -L ${outpath}/${sample}/${sample}.json -N 'initial stats' \
          -1 ${inpath}/${sample}/*R1.fastq.gz \
          -2 ${inpath}/${sample}/*R2.fastq.gz | \
      hts_SeqScreener -A ${outpath}/${sample}/${sample}.json -N 'screen phix' | \
      hts_SeqScreener -A ${outpath}/${sample}/${sample}.json -N 'count the number of rRNA reads'\
          -r -s References/mouse_rrna.fasta | \
      hts_SuperDeduper -A ${outpath}/${sample}/${sample}.json -N 'remove PCR duplicates' | \
      hts_AdapterTrimmer -A ${outpath}/${sample}/${sample}.json -N 'trim adapters' | \
      hts_PolyATTrim --no-left --skip_polyT -A ${outpath}/${sample}/${sample}.json -N 'remove polyAT tails' | \
      hts_NTrimmer -A ${outpath}/${sample}/${sample}.json -N 'remove any remaining N characters' | \
      hts_QWindowTrim -A ${outpath}/${sample}/${sample}.json -N 'quality trim the ends of reads' | \
      hts_LengthFilter -A ${outpath}/${sample}/${sample}.json -N 'remove reads < 50bp' \
          -n -m 50 | \
      hts_Stats -A ${outpath}/${sample}/${sample}.json -N 'final stats' \
          -f ${outpath}/${sample}/${sample}"

echo $call
eval $call

end=`date +%s`
runtime=$((end-start))
echo $runtime

When I run my script, I get an error for the line that has the
 hts_Stats -L 
command that says "no such file or directory"
I have tried running the program with and without a blank file that I've named but the error is the same. It seems the program cannot create or see the log.

Here is how my script looks:

!/bin/bash

SBATCH --job-name=htstream_N

SBATCH --nodes=1

SBATCH --ntasks=10

SBATCH --time=12:00:00

SBATCH --mem=32000

SBATCH --output=slurmout/htstreamN%A_%a.out # File to which STDOUT will be written

SBATCH --error=slurmout/htstreamN%A_%a.err # File to which STDERR

SBATCH --mail-user=

SBATCH --mail-type=ALL

start=date +%s

echo $HOSTNAME echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

module load deprecated/HTStream/1.3.3

call=" hts_Stats -L /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "initial stats" \

-1 /home/user/RNASEQDATADUMP/NWTsalt8/NWTsalt8_CKDL220024596-1A_HJ2W5BBXX_L6_1.fq.gz \ -2 /home/user/RNASEQDATADUMP/NWTsalt8/NWTsalt8_CKDL220024596-1A_HJ2W5BBXX_L6_1.fq.gz | \

hts_SeqScreener -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "screen phix" | \ hts_SeqScreener -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "count rRNAs" \ --record --seq /home/user/RNASEQDATADUMP/wheat_rrna.fasta | \ hts_Overlapper -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "overlap PE reads" | \ hts_AdapterTrimmer -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "trim adapters" | \ hts_PolyATTrim -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "trim polyAT" | \ hts_NTrimmer -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "remove N chars" | \ hts_QWindowTrim -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "quality trim ends" | \ hts_LengthFilter -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "remove reads < 50 bp" \ -n -m 50 | \ hts_Stats -A /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "final stats" \ -f /home/user/RNASEQDATADUMP/HTS_outputs/N_hts_paired.htstream "

echo $call eval $call

end=date +%s runtime=$((end-start)) echo $runtime



My filepath is correct and it fails on the line for hts_Stats -L /home/user/RNASEQDATADUMP/HTS_outputs/N_htsStats.json -N "initial stats" \

I would really appreciate any help. my IT department suggested something could be being incorrectly send to my slurmout folder, but the .out and .err files are writing normally. 

The error is like this:
/var/spool/slurmd/job66287601/slurm_script: line 23: 50: No such file or directory

I am not a programmer so I am not sure how to proceeed. I also tried this with the 1.3.2 version and had the same problem.

Thank you for any insights or corrections.
AstrobioMike commented 1 year ago

Heya, @Emmashipman,

Just came across this while looking at some other issues here, are you sure the slurm compute node is connected to the location you are trying to access?

You can check that for sure by just making a new slurm script where the call is just checking for the directory you are trying to use. So like just making the call line something like this:

call="ls /home/user/RNASEQDATADUMP/HTS_outputs/"

Running that slurm script and seeing if it prints the contents of that location or if it gives you an error will at least let you know if slurm itself can see the location. (you could also probably just run srun ls /home/user/RNASEQDATADUMP/HTS_outputs/ to check, but might as well do everything the same in case the script is making a difference)

good luck!

Emmashipman commented 1 year ago

Hi, thanks for the reply! I fixed that, it was a small error in my shell script, but I have found a more serious bug.... when I run a set of the HTS programs on a pair of PE RNASeq data, I get a very strange output and error message.

The output does not produce any trimmed fastq.gz files, and the JSON file with the record of the programs run only includes a few of the programs (e.g., NTrimmer, QWindowtrim, and the final run of hts_Stats) and show hardly any input or output reads (about 2400, instead of several million in the file). It is like this:

"Program_details": { "program": "hts_NTrimmer", "version": "v1.3.3", "options": { "append-stats-file": "/home/enshipma/RNASEQDATADUMP/HTS_outputs/N_h> "exclude": false, "force": false, "notes": "remove N chars", "uncompressed": false } }, "Fragment": { "in": 2319, "out": 2319, "basepairs_in": 694532, "basepairs_out": 689580 }, "Single_end": { "in": 18, "out": 18, "basepairs_in": 4263, "basepairs_out": 4234,

As you can see, the number of reads is really low, and the log has missed my entire first five programs, starting with the NTrimmer.

I also get this error message on some of my runs.

terminate called after throwing an instance of 'std::out_of_range' what():  basic_string::substr: __pos (which is 10) > this->size() (which is 9) ERROR: There are not either 3 (SE, itab3), 4 (SE, itab with tags) 5 (PE, itab5), or 6 (PE, itab6), or 8 (PE, itab6 with tags) elements within a tab delimited file line

I found the error message in the github page, saying it had to do with the i/o handler. I assume it's a problem with the tab delimited files that get streamed from program to program.

I just ran a test today trying to only use a few of the hts programs (the SeqScreener), to see if I could isolate the point at which the files failed. I got a json file that looked as expected and no fastq output files.

Here is my code: hts_Stats -L /home/enshipma/RNASEQDATADUMP/HTS_outputs/B_htsStats.json -N 'initi al stats' -1 /home/enshipma/RNASEQDATADUMP/BWTsalt3/BWTsalt3_CKDL220018431-1A_HJ 2W5BBXX_L6_1.fq.gz -2 /home/enshipma/RNASEQDATADUMP/BWTsalt3/BWTsalt3_CKDL220018 431-1A_HJ2W5BBXX_L6_1.fq.gz | hts_SeqScreener -A /home/enshipma/RNASEQDATADUMP/H TS_outputs/B_htsStats.json -N 'screen phix' | hts_SeqScreener -A /home/enshipma/ RNASEQDATADUMP/HTS_outputs/B_htsStats.json -N 'count rRNAs' --record --seq /home/enshipma/RNASEQDATADUMP/wheat_rrna.fasta | hts_Stats -A /home/enshipma/RNASEQDATADUMP/HTS_outputs/B_htsStats.json -N 'final stats' \ -f /home/enshipma/RNASEQDATADUMP/HTS_outputs/B_hts_test1/

The slurmout file seemed to have read out all 55 million lines to the stdout, and the error file is blank. I have no output files from this in fastq format. My next step will be to include the next program following Seqscreener, and see if the stream fails there...

I am just wondering if you have encountered this issue before and if you know what step can end up being overzealous in trimming?