hukai916 / scATACpipe

MIT License
22 stars 2 forks source link

Cutadapt timeout #5

Open priscilla-glenn opened 1 year ago

priscilla-glenn commented 1 year ago

Hi, I've tried running scATACpipe and while this time it started without issues, the cutadapt has timed out. I've included my nextflow.log for your convenience as I'm not sure why it was taking so long to run. I did notice one discrepancy where cutadapt said it was trying to use 32 cores (pasted below) even though in my config file I said the max should be 28 cores. So my one guess would be the computer is trying to use 32 cores when I didn't give it access to 32. Would there be a way to fix it so cutadapt is trying to use 32 cores?

Thanks!

Command output: This is cutadapt 3.4 with Python 3.9.1 Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz Processing reads on 32 cores in paired-end mode ... nextflow.log_May12.txt

Command wrapper: This is cutadapt 3.4 with Python 3.9.1 Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz Processing reads on 32 cores in paired-end mode ...

hukai916 commented 1 year ago

Hi @priscilla-glenn, the --core 0 flag directs cutadapt to auto detect and use all available cores. There are multiple places to specify config parameters to Nextflow, the order of their priorities can be found here. Your "28 cores" might likely be overwritten depending on where you specified it.

To further debug the cutadapt timeout issue, you can go to the working directory, e.g. /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841, and check out the .command.log file. You can also share it here.

priscilla-glenn commented 1 year ago

Thank you. I will try resetting some of the config parameters to see if that helps.

In addition, here is the .command.log file for your reference.

This is cutadapt 3.4 with Python 3.9.1 Command line parameters: --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz Processing reads on 32 cores in paired-end mode ... nxf-uiXqo6FQ015ZXvCk4oxA87gZ

My current config file (before adjusting more) params { preprocess = 'default' input_fastq = '/home/blackmonlab/big-storage/Priscilla/SingleCell/MouseSC/scATACpipe_mouse_samplesheet.csv' outdir = './scATAC_mouse2' species_latin_name = 'Mus musculus' ref_fasta_ensembl = 'mus_musculus' split_fastq = false barcode_correction = 'pheniqs' whitelist_barcode = 'assets/whitelist_barcodes' filter = 'both' doublet_removal_algorithm = 'archr' archr_thread = 8 archr_blacklist = false archr_batch_correction_harmony = true filter_sample = false filter_seurat_ilsi = false custom_peaks = false archr_scrnaseq = false archr_scrnaseq_grouplist = false max_memory = '110.GB' max_cpus = 28 max_time = '240.h' }

And here is the .command.run file as well. Please let me know if there any other files that could help with the troubleshooting. Thank you!

!/bin/bash

NEXTFLOW TASK: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1)

set -e set -u NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x NXF_ENTRY=${1:-nxf_main}

nxf_tree() { local pid=$1

declare -a ALL_CHILDREN
while read P PP;do
    ALL_CHILDREN[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)

pstat() {
    local x_pid=$1
    local STATUS=$(2> /dev/null < /proc/$1/status grep -E 'Vm|ctxt')

    if [ $? = 0 ]; then
    local  x_vsz=$(echo "$STATUS" | grep VmSize | awk '{print $2}' || echo -n '0')
    local  x_rss=$(echo "$STATUS" | grep VmRSS | awk '{print $2}' || echo -n '0')
    local x_peak=$(echo "$STATUS" | grep -E 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n '0 0')
    local x_pmem=$(awk -v rss=$x_rss -v mem_tot=$mem_tot 'BEGIN {printf "%.0f", rss/mem_tot*100*10}' || echo -n '0')
    local vol_ctxt=$(echo "$STATUS" | grep '\bvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
    local inv_ct

xt=$(echo "$STATUS" | grep '\bnonvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0') cpu_stat[x_pid]="$x_pid $x_pmem $x_vsz $x_rss $x_peak $vol_ctxt $inv_ctxt" fi }

pwalk() {
    pstat $1
    for i in ${ALL_CHILDREN[$1]:=}; do pwalk $i; done
}

pwalk $1

}

nxf_stat() { cpu_stat=() nxf_tree $1

declare -a sum=(0 0 0 0 0 0 0 0)
local pid
local i
for pid in "${!cpu_stat[@]}"; do
    local row=(${cpu_stat[pid]})
    [ $NXF_DEBUG = 1 ] && echo "++ stat mem=${row[*]}"
    for i in "${!row[@]}"; do
    if [ $i != 0 ]; then
        sum[i]=$((sum[i]+row[i]))
    fi
    done
done

[ $NXF_DEBUG = 1 ] && echo -e "++ stat SUM=${sum[*]}"

for i in {1..7}; do
    if [ ${sum[i]} -lt ${cpu_peak[i]} ]; then
        sum[i]=${cpu_peak[i]}
    else
        cpu_peak[i]=${sum[i]}
    fi
done

[ $NXF_DEBUG = 1 ] && echo -e "++ stat PEAK=${sum[*]}\n"
nxf_stat_ret=(${sum[*]})

}

nxf_mem_watch() { set -o pipefail local pid=$1 local trace_file=.command.trace local count=0; declare -a cpu_stat=(0 0 0 0 0 0 0 0) declare -a cpu_peak=(0 0 0 0 0 0 0 0) local mem_tot=$(< /proc/meminfo grep MemTotal | awk '{print $2}') local timeout local DONE local STOP=''

[ $NXF_DEBUG = 1 ] && nxf_sleep 0.2 && ps fx

while true; do
    nxf_stat $pid
    if [ $count -lt 10 ]; then timeout=1;
    elif [ $count -lt 120 ]; then timeout=5;
    else timeout=30;
    fi
    read -t $timeout -r DONE || true
    [[ $DONE ]] && break
    if [ ! -e /proc/$pid ]; then
        [ ! $STOP ] && STOP=$(nxf_date)
        [ $(($(nxf_date)-STOP)) -gt 10000 ] && break
    fi
    count=$((count+1))
done

echo "%mem=${nxf_stat_ret[1]}"      >> $trace_file
echo "vmem=${nxf_stat_ret[2]}"      >> $trace_file
echo "rss=${nxf_stat_ret[3]}"       >> $trace_file
echo "peak_vmem=${nxf_stat_ret[4]}" >> $trace_file
echo "peak_rss=${nxf_stat_ret[5]}"  >> $trace_file
echo "vol_ctxt=${nxf_stat_ret[6]}"  >> $trace_file
echo "inv_ctxt=${nxf_stat_ret[7]}"  >> $trace_file

}

nxf_write_trace() { echo "nextflow.trace/v2" > $trace_file echo "realtime=$wall_time" >> $trace_file echo "%cpu=$ucpu" >> $trace_file echo "rchar=${io_stat1[0]}" >> $trace_file echo "wchar=${io_stat1[1]}" >> $trace_file echo "syscr=${io_stat1[2]}" >> $trace_file echo "syscw=${io_stat1[3]}" >> $trace_file echo "read_bytes=${io_stat1[4]}" >> $trace_file echo "write_bytes=${io_stat1[5]}" >> $trace_file }

nxf_trace_mac() { local start_millis=$(nxf_date)

/bin/bash -euo pipefail /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.sh

local end_millis=$(nxf_date)
local wall_time=$((end_millis-start_millis))
local ucpu=''
local io_stat1=('' '' '' '' '' '')
nxf_write_trace

}

nxf_fd() { local FD=11 while [ -e /proc/$$/fd/$FD ]; do FD=$((FD+1)); done echo $FD }

nxf_trace_linux() { local pid=$$ command -v ps &>/dev/null || { >&2 echo "Command 'ps' required by nextflow to collect task metrics cannot be found"; exit 1; } local num_cpus=$(< /proc/cpuinfo grep '^processor' -c) local tot_time0=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}') local cpu_time0=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)10 }' || echo -n 'X') local io_stat0=($(2> /dev/null < /proc/$pid/io sed 's/^.:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0')) local start_millis=$(nxf_date) trap 'kill $mem_proc' ERR

/bin/bash -euo pipefail /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.sh &
local task=$!

mem_fd=$(nxf_fd)
eval "exec $mem_fd> >(nxf_mem_watch $task)"
local mem_proc=$!

wait $task

local end_millis=$(nxf_date)
local tot_time1=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}')
local cpu_time1=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)*10 }' || echo -n 'X')
local ucpu=$(awk -v p1=$cpu_time1 -v p0=$cpu_time0 -v t1=$tot_time1 -v t0=$tot_time0 -v n=$num_cpus 'BEGIN { pct=(p1-p0)/(t1-t0)*100*n; printf("%.0f", pct>0 ? pct : 0) }' )

local io_stat1=($(2> /dev/null < /proc/$pid/io sed 's/^.*:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0'))
local i
for i in {0..5}; do
    io_stat1[i]=$((io_stat1[i]-io_stat0[i]))
done

local wall_time=$((end_millis-start_millis))
[ $NXF_DEBUG = 1 ] && echo "+++ STATS %CPU=$ucpu TIME=$wall_time I/O=${io_stat1[*]}"

echo "nextflow.trace/v2"           > $trace_file
echo "realtime=$wall_time"         >> $trace_file
echo "%cpu=$ucpu"                  >> $trace_file
echo "rchar=${io_stat1[0]}"        >> $trace_file
echo "wchar=${io_stat1[1]}"        >> $trace_file
echo "syscr=${io_stat1[2]}"        >> $trace_file
echo "syscw=${io_stat1[3]}"        >> $trace_file
echo "read_bytes=${io_stat1[4]}"   >> $trace_file
echo "write_bytes=${io_stat1[5]}"  >> $trace_file

[ -e /proc/$mem_proc ] && eval "echo 'DONE' >&$mem_fd" || true
wait $mem_proc 2>/dev/null || true
while [ -e /proc/$mem_proc ]; do nxf_sleep 0.1; done

}

nxf_trace() { local trace_file=.command.trace touch $trace_file if [[ $(uname) = Darwin ]]; then nxf_trace_mac else nxf_trace_linux fi } nxf_container_env() { cat << EOF export HDF5_USE_FILE_LOCKING="FALSE" export RHDF5_USE_FILE_LOCKING="FALSE" export NXF_SINGULARITY_CACHEDIR="./work/singularity" export PYTHONNOUSERSITE="1" export R_PROFILE_USER="/.Rprofile" export R_ENVIRON_USER="/.Renviron" export PATH="\$PATH:/mnt/md1/Priscilla/SingleCell/scATACpipe/bin" EOF }

nxf_sleep() { sleep $1 2>/dev/null || sleep 1; }

nxf_date() { local ts=$(date +%s%3N); if [[ ${#ts} == 10 ]]; then echo ${ts}000 elif [[ $ts == %3N ]]; then echo ${ts/\%3N/000} elif [[ $ts == 3N ]]; then echo ${ts/3N/000} elif [[ ${#ts} == 13 ]]; then echo $ts else echo "Unexpected timestamp value: $ts"; exit 1 fi }

nxf_env() { echo '============= task environment =============' env | sort | sed "s/(.)AWS(.)=(.{6}).*/\1AWS\2=\3xxxxxxxxxxxxx/" echo '============= task output ==================' }

nxf_kill() { declare -a children while read P PP;do children[$PP]+=" $P" done < <(ps -e -o pid= -o ppid=)

kill_all() {
    [[ $1 != $$ ]] && kill $1 2>/dev/null || true
    for i in ${children[$1]:=}; do kill_all $i; done
}

kill_all $1

}

nxf_mktemp() { local base=${1:-/tmp} mkdir -p "$base" if [[ $(uname) = Darwin ]]; then mktemp -d $base/nxf.XXXXXXXXXX else TMPDIR="$base" mktemp -d -t nxf.XXXXXXXXXX fi }

nxf_fs_copy() { local source=$1 local target=$2 local basedir=$(dirname $1) mkdir -p $target/$basedir cp -fRL $source $target/$basedir }

nxf_fs_move() { local source=$1 local target=$2 local basedir=$(dirname $1) mkdir -p $target/$basedir mv -f $source $target/$basedir }

nxf_fs_rsync() { rsync -rRl $1 $2 }

nxf_fs_rclone() { rclone copyto $1 $2/$1 }

nxf_fs_fcp() { fcp $1 $2/$1 }

on_exit() { exit_status=${nxf_main_ret:=$?} printf $exit_status > /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.exitcode set +u [[ "$tee1" ]] && kill $tee1 2>/dev/null [[ "$tee2" ]] && kill $tee2 2>/dev/null [[ "$ctmp" ]] && rm -rf $ctmp || true docker rm $NXF_BOXID &>/dev/null || true sync || true exit $exit_status }

on_term() { set +e docker kill $NXF_BOXID }

nxf_launch() { docker run -i --cpu-shares 12288 --memory 8192m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /mnt/md1/Priscilla/SingleCell/scATACpipe:/mnt/md1/Priscilla/SingleCell/scATACpipe -w "$PWD" -u $(id -u):$(id -g) --rm -v /Users:/Users -v /tmp:/tmp --name $NXF_BOXID hukai916/cutadapt_xenial:0.1 /bin/bash -c "eval $(nxf_container_env); /bin/bash /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.run nxf_trace" }

nxf_stage() { true

stage input files

rm -f R1_Sample_S1_1.barcoded.fastq.gz
rm -f R2_Sample_S1_1.barcoded.fastq.gz
ln -s /mnt/md1/Priscilla/SingleCell/scATACpipe/work/ca/08da56da2ae1c4e0385a13edd8b6fc/R1_Sample_S1_1.barcoded.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz
ln -s /mnt/md1/Priscilla/SingleCell/scATACpipe/work/ca/08da56da2ae1c4e0385a13edd8b6fc/R2_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz

}

nxf_unstage() { true [[ ${nxf_main_ret:=0} != 0 ]] && return }

nxf_main() { trap on_exit EXIT trap on_term TERM INT USR2 trap '' USR1

[[ "${NXF_CHDIR:-}" ]] && cd "$NXF_CHDIR"
export NXF_BOXID="nxf-$(dd bs=18 count=1 if=/dev/urandom 2>/dev/null | base64 | tr +/ 0A)"
NXF_SCRATCH=''
[[ $NXF_DEBUG > 0 ]] && nxf_env
touch /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841/.command.begin
set +u
set -u
[[ $NXF_SCRATCH ]] && cd $NXF_SCRATCH
nxf_stage

set +e
ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
local cout=$ctmp/.command.out; mkfifo $cout
local cerr=$ctmp/.command.err; mkfifo $cerr
tee .command.out < $cout &
tee1=$!
tee .command.err < $cerr >&2 &
tee2=$!
( nxf_launch ) >$cout 2>$cerr &
pid=$!
wait $pid || nxf_main_ret=$?
wait $tee1 $tee2
nxf_unstage

}

$NXF_ENTRY

hukai916 commented 1 year ago

Typically, cutadapt should be very fast. I think next debugging step is to try to run the problematic command directly. Here are how:

  1. go to the working directory: cd /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841
  2. activate the container for cutadapt: hukai916/cutadapt_xenial:0.1
    • docker run -it hukai916/cutadapt_xenial:0.1 if you are using docker. Or singularity shell xxx.sif if you are using singularity, the xxx.sif file should be under the work/singularity directory.
  3. copy the command from .command.sh and run to see if you can replicate the error.

Btw, I notice another config /mnt/md1/Priscilla/SingleCell/scATACpipe/Mouse_custom.config, which might set max_cpus = 32 and override the default?

priscilla-glenn commented 1 year ago

Thank you.

I went to the working directory cd /mnt/md1/Priscilla/SingleCell/scATACpipe/work/b5/ec60d7f751e55ff29b6016de4ba841 and activated the container for cutadapt successfully. My terminal has (base) root@b407de9f1cc3:/# on the far left.

I then tried to run the code above from .command.sh cutadapt --cores=0 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o R1_Sample_S1_1.barcoded.trimmed.fastq.gz -p R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz

However, I get an error stating it cannot find my input file. And when I look in the directory using ls while in the docker container, the files from my original working directory are no longer listed.

ERROR: Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/cutadapt/pipeline.py", line 552, in run with self._opener.xopen(self.path, 'rb') as f: File "/opt/conda/lib/python3.9/site-packages/cutadapt/utils.py", line 186, in xopen f = open_raise_limit( File "/opt/conda/lib/python3.9/site-packages/cutadapt/utils.py", line 51, in open_raise_limit f = func(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/xopen/init.py", line 609, in xopen return _open_gz(filename, mode, compresslevel, threads) File "/opt/conda/lib/python3.9/site-packages/xopen/init.py", line 505, in _open_gz return igzip.open(filename, mode) File "/opt/conda/lib/python3.9/site-packages/isal/igzip.py", line 92, in open binary_file = IGzipFile(filename, gz_mode, compresslevel) File "/opt/conda/lib/python3.9/site-packages/isal/igzip.py", line 151, in init super().init(filename, mode, compresslevel, fileobj, mtime) File "/opt/conda/lib/python3.9/gzip.py", line 173, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'R1_Sample_S1_1.barcoded.fastq.gz'

As for the Mouse_custom.config, that is the config I generated for this project and in it I stated that max_cpus = 28

priscilla-glenn commented 1 year ago

Just thought I'd give an update.

I went into the cutadapt.nf code and replaced the option variable with cores=28. However, when I resume the session, while I can see that the command.sh file now states cores=28 instead of cores=0, it still doesn't appear to work. When I first restart the session, on my system monitor, I can see that it starts working and the cores are being used, however, after about five minutes it starts going into a cycle of not finishing cutadapt and the system shows that not much processing is occurring on the cores.

In addition, I ran cutadapt 4.4 outside of scATACpipe to try to get an estimate of how long cutadapt should take on my data and it finished in about 30 minutes.

At this point, I'm not sure why it won't finish cutadapt. So Hopefully getting it to run in the container for cutadapt will allow us to debug the issue.

Thanks,

May-17 11:28:37.222 [Task submitter] INFO nextflow.Session - [a0/bc5742] Submitted process > SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1) May-17 11:33:36.310 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below ~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974] May-17 11:38:36.387 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below ~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974] May-17 11:43:36.451 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below ~> TaskHandler[id: 12; name: SCATACPIPE:PREPROCESS_DEFAULT:CUTADAPT (1); status: RUNNING; exit: -; error: -; workDir: /mnt/md1/Priscilla/SingleCell/scATACpipe/work/a0/bc574254865a25e80499c88af84974]

hukai916 commented 1 year ago

Hi, I am on vacation and might be slow in response.

To use docker, you need to mount your local folder to the container, check out here. Or, you can run docker command in non-interactive mode without the -it flag. Let me know if any questions.

priscilla-glenn commented 1 year ago

Thanks for your help. I was able to mount the folder which held the barcode files (not the pointer files) and run cutadapt. It ran correctly and finished in 25 minutes so I'm not sure why it's not working when I run the full pipeline. At this point I will try to redownload scATACpipe and recreate the config file.

hukai916 commented 1 year ago

Maybe also try a fresh start by not using the -resume.

priscilla-glenn commented 1 year ago

Hi, so I did a fresh start without using resume and no luck.

Though I did not notice the last time I tried to resume and this time with the fresh restart, that my swap memory was suddenly filling up and as soon the swap memory filled, my core usage dropped and the cutadapt started cycling. This only happens in the scATACpipe container as I was successfully able to run the cutadapt container by itself and run cutadapt on my own outside of any container.

I have 32 cores and 130 Gb of memory on my computer. And have 488,981,686 read pairs in my files. Is this a system requirement issue or is there a parameter I could fix for in usage with the scATACpipe container?

Thank you

hukai916 commented 1 year ago

The resources on your computer should be more than sufficient for cutadapt. I don't quite understand why the container would not work with the pipeline. I have created a new container using the latest version of cutadapt (v4.4), do you mind trying to change the following line: https://github.com/hukai916/scATACpipe/blob/8e0048985ee2e8fc2a466065ec9d96b2de6c9517/modules/local/cutadapt.nf#L12 to container "hukai916/cutadapt:0.1", and try again?

priscilla-glenn commented 1 year ago

Hi, I will try that and let you know how it goes.

Also, I tried running the pipeline on another computer in the lab set to 60 cores and 200 G memory. Again cutadapt failed but this time it gave a different error:

"WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."

What's interesting is when I watch my system monitor, my memory has not filled up, it's only the swap memory that suddenly is filled and it's my understanding that the computer shouldn't use swap until the memory is filled. So I don't know why its filling the swap instead. Is there anything in the container settings about swap?

One idea I have as I rerun it is to lower the core number to give each core more memory. I'll keep you posted. Thanks for all your help with this.

priscilla-glenn commented 1 year ago

Good morning, I've just checked the code I ran this weekend and after adjusting the cutadapt pipeline and running lower cores, I received an error in executing the "GET_WHITELIST_BARCODE". I've attached the nextflow log and my config.

23May2023_nextflow.log

Mouse_config.txt

Caused by: Process SCATACPIPE:PREPROCESS_DEFAULT:GET_WHITELIST_BARCODE (1) terminated with an error exit status (1)

Command executed:

step1: concatenate all barcode reads or read chunks that belong to the sample_name:

cat barcode_SampleS1*.fastq.gz > barcode_Sample_S1_combined.fastq.gz

step2: determine the whitelist barcode, use 10,000 reads for quick test:

(zcat barcode_Sample_S1_combined.fastq.gz || true) | awk '{ print $0 } NR==40000 {exit}' | gzip > subset.fastq.gz

note "|| true" is to capture and skip the SIGPIPE error

get_whitelist_barcode.py whitelist_barcodes subset.fastq.gz whitelist_SampleS1

Command exit status: 1

Command output: (empty)

Command error: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. Traceback (most recent call last): File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files] File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files] File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 27, in get_fraction with open(barcode_file) as f: IsADirectoryError: [Errno 21] Is a directory: 'whitelist_barcodes/.nextflow'

Work dir: /home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/0e/ff0d1341e03ec0e39bf50fe9235b8c

hukai916 commented 1 year ago

Can you try remove the whitelist_barcodes/.nextflow folder ? the whitelist_barcodes folder supposes to contain only whitelist barcode files not any hidden folders.

On Mon, May 22, 2023 at 8:26 AM Priscilla Glenn @.***> wrote:

Good morning, I've just checked the code I ran this weekend and after adjusting the cutadapt pipeline and running lower cores, I received an error in executing the "GET_WHITELIST_BARCODE". I've attached the nextflow log and my config.

23May2023_nextflow.log https://github.com/hukai916/scATACpipe/files/11531939/23May2023_nextflow.log

Mouse_config.txt https://github.com/hukai916/scATACpipe/files/11532007/Mouse_config.txt

Caused by: Process SCATACPIPE:PREPROCESS_DEFAULT:GET_WHITELIST_BARCODE (1) terminated with an error exit status (1)

Command executed: step1: concatenate all barcode reads or read chunks that belong to the sample_name:

cat barcode_SampleS1*.fastq.gz > barcode_Sample_S1_combined.fastq.gz step2: determine the whitelist barcode, use 10,000 reads for quick test:

(zcat barcode_Sample_S1_combined.fastq.gz || true) | awk '{ print $0 } NR==40000 {exit}' | gzip > subset.fastq.gz note "|| true" is to capture and skip the SIGPIPE error

get_whitelist_barcode.py whitelist_barcodes subset.fastq.gz whitelist_SampleS1

Command exit status: 1

Command output: (empty)

Command error: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. Traceback (most recent call last): File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files] File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 55, in fractions = [get_fraction(index_fastq, barcode_file) for barcode_file in barcode_files] File "/home/blackmonlab/Desktop/scATACpipe/scATACpipe/bin/get_whitelist_barcode.py", line 27, in get_fraction with open(barcode_file) as f: IsADirectoryError: [Errno 21] Is a directory: 'whitelist_barcodes/.nextflow'

Work dir:

/home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/0e/ff0d1341e03ec0e39bf50fe9235b8c

— Reply to this email directly, view it on GitHub https://github.com/hukai916/scATACpipe/issues/5#issuecomment-1557320040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHP63V3ETZFTZYGGLGPOTRDXHNZRTANCNFSM6AAAAAAYCLKK24 . You are receiving this because you commented.Message ID: @.***>

priscilla-glenn commented 1 year ago

Okay so cutadapt actually finished! But then of course a new error had to pop up with bwa mem. "[e7/7d7c3b] NOTE: Process SCATACPIPE:PREPROCESS_DEFAULT:BWA_MAP (1) terminated with an error exit status (1) -- Execution is retried (2)"

I check the command.err and .log and both say

"WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 4714624 sequences (240000007 bp)... [M::process] read 4714608 sequences (240000072 bp)..." samtools sort: failed to read header from "-"

and the command.sh files says:

!/bin/bash -euo pipefail

filename=$(basename bwa_index/.bwt) index_name="${filename%.}"

sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz outname="${samplename%%.*}" outname="${outname#R1}"

bwa mem -t 24 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 24 -m 715827882 -O bam -o ${outname}.sorted.bam

hukai916 commented 1 year ago

Sorry for the late reply, I am back to work now. Seems that your bwa_index folder is empty, by default, if bwa index files are not supplied, scATACpipe will build them using the supplied genome fastq file. Can you check the SCATACPIPE:PREPROCESS_DEFAULT:BWA_INDEX output first? Let me know if any questions.

priscilla-glenn commented 1 year ago

Hi, welcome back. I hope you had a great vacation. I tried rerunning the pipeline without supplying any index files so everything would be the default version. I've included a lot below to try to figure out why it's not working. Let me know if you need any other files.

First, here is my config file for this run so it should have created everything itself.

// Save the following configurations into custom.config params { preprocess = 'default' input_fastq = '/home/blackmonlab/Desktop/scATACpipe/scATACpipe/sample_sheet_test_data1.csv' outdir = './MouseSC3' species_latin_name = 'Mus musculus' ref_fasta_ensembl = 'mus_musculus' split_fastq = false barcode_correction = 'pheniqs' whitelist_barcode = '/home/blackmonlab/Desktop/scATACpipe/scATACpipe/assets/whitelist_barcodes' filter = 'both' doublet_removal_algorithm = 'archr' archr_thread = 8 archr_blacklist = false archr_batch_correction_harmony = true filter_sample = false filter_seurat_ilsi = false custom_peaks = false archr_scrnaseq = false archr_scrnaseq_grouplist = false max_memory = '200.GB' max_cpus = 32 max_time = '240.h' }

// Use this command: sudo nextflow run main.nf -c Mouse_custom.config -profile docker,local.

but here is the error again.


Caused by:
  Process `SCATACPIPE:PREPROCESS_DEFAULT:BWA_MAP (1)` terminated with an error exit status (1)

Command executed:

  filename=$(basename bwa_index/*.bwt)
  index_name="${filename%.*}"

  sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz
  outname="${sample_name%%.*}"
  outname="${outname#R1_}"

  bwa mem  -t 32 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  [M::bwa_idx_load_from_disk] read 0 ALT contigs
  [M::process] read 6286194 sequences (320000070 bp)...
  [M::process] read 6286100 sequences (320000029 bp)...
  samtools sort: failed to read header from "-"

Work dir:
  /home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/8d/b51980964f520ea4ee9555e00a3dfc

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

When I look at the output for BWA_index the command log seems to have indicated it worked.

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[bwa_index] Pack FASTA... 28.16 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=5456444902, availableWord=395935328
[BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999990 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999990 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999990 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999990 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999990 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999990 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999990 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999999990 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1099999990 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1199999990 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1299999990 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1399999990 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1499999990 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1599999990 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1699999990 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1799999990 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1899999990 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1999999990 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 2099999990 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 2199999990 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 2299999990 characters processed.
[BWTIncConstructFromPacked] 240 iterations done. 2399999990 characters processed.
[BWTIncConstructFromPacked] 250 iterations done. 2499999990 characters processed.
[BWTIncConstructFromPacked] 260 iterations done. 2599999990 characters processed.
[BWTIncConstructFromPacked] 270 iterations done. 2699999990 characters processed.
[BWTIncConstructFromPacked] 280 iterations done. 2799999990 characters processed.
[BWTIncConstructFromPacked] 290 iterations done. 2899999990 characters processed.
[BWTIncConstructFromPacked] 300 iterations done. 2999999990 characters processed.
[BWTIncConstructFromPacked] 310 iterations done. 3099999990 characters processed.
[BWTIncConstructFromPacked] 320 iterations done. 3199999990 characters processed.
[BWTIncConstructFromPacked] 330 iterations done. 3299999990 characters processed.
[BWTIncConstructFromPacked] 340 iterations done. 3399999990 characters processed.
[BWTIncConstructFromPacked] 350 iterations done. 3499999990 characters processed.
[BWTIncConstructFromPacked] 360 iterations done. 3599999990 characters processed.
[BWTIncConstructFromPacked] 370 iterations done. 3699999990 characters processed.
[BWTIncConstructFromPacked] 380 iterations done. 3799999990 characters processed.
[BWTIncConstructFromPacked] 390 iterations done. 3899999990 characters processed.
[BWTIncConstructFromPacked] 400 iterations done. 3999999990 characters processed.
[BWTIncConstructFromPacked] 410 iterations done. 4099999990 characters processed.
[BWTIncConstructFromPacked] 420 iterations done. 4199999990 characters processed.
[BWTIncConstructFromPacked] 430 iterations done. 4299999990 characters processed.
[BWTIncConstructFromPacked] 440 iterations done. 4399999990 characters processed.
[BWTIncConstructFromPacked] 450 iterations done. 4499999990 characters processed.
[BWTIncConstructFromPacked] 460 iterations done. 4599999990 characters processed.
[BWTIncConstructFromPacked] 470 iterations done. 4699999990 characters processed.
[BWTIncConstructFromPacked] 480 iterations done. 4799789782 characters processed.
[BWTIncConstructFromPacked] 490 iterations done. 4892030822 characters processed.
[BWTIncConstructFromPacked] 500 iterations done. 4974010886 characters processed.
[BWTIncConstructFromPacked] 510 iterations done. 5046870966 characters processed.
[BWTIncConstructFromPacked] 520 iterations done. 5111625190 characters processed.
[BWTIncConstructFromPacked] 530 iterations done. 5169174934 characters processed.
[BWTIncConstructFromPacked] 540 iterations done. 5220321238 characters processed.
[BWTIncConstructFromPacked] 550 iterations done. 5265776182 characters processed.
[BWTIncConstructFromPacked] 560 iterations done. 5306172630 characters processed.
[BWTIncConstructFromPacked] 570 iterations done. 5342073078 characters processed.
[BWTIncConstructFromPacked] 580 iterations done. 5373977430 characters processed.
[BWTIncConstructFromPacked] 590 iterations done. 5402330102 characters processed.
[BWTIncConstructFromPacked] 600 iterations done. 5427525990 characters processed.
[BWTIncConstructFromPacked] 610 iterations done. 5449916134 characters processed.
[bwt_gen] Finished constructing BWT in 614 iterations.
[bwa_index] 1742.02 seconds elapse.
[bwa_index] Update BWT... 25.69 sec
[bwa_index] Pack forward-only FASTA... 22.23 sec
[bwa_index] Construct SA from BWT and Occ... 781.42 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw bwa_index/Genome.primary.chrPrefixed.fa.gz
[main] Real time: 2638.710 sec; CPU: 2599.526 sec

And in the 6c work folder, there is a bwa_index folder with files in them.

-rw-r--r-- 1 root root       5961 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.amb
-rw-r--r-- 1 root root       2478 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.ann
-rw-r--r-- 1 root root 2728222532 May 31 11:33 Genome.primary.chrPrefixed.fa.gz.bwt
-rw-r--r-- 1 root root  682055614 May 31 11:34 Genome.primary.chrPrefixed.fa.gz.pac
-rw-r--r-- 1 root root 1364111280 May 31 11:47 Genome.primary.chrPrefixed.fa.gz.sa

And in the cutadapt c6 folder, there are files as well.

log_cutadapt_Sample_S1_1.txt R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz R1_Sample_S1_1.barcoded.fastq.gz R2_Sample_S1_1.barcoded.fastq.gz

and all seem to have something in them.

Here is the beginning of the R2_sample file:

"@NTAGCTCGAGATTGGG:VH00580:38:AACMLYGM5:1:1101:65437:1284 3:N:0:TTTGGACG GTCCTGGCTTGTTTTATGTCAACTTGACACAAGCTAGAGTCATCTTAGAGG + CCCCCCCCCCC;CCCCCCC-C;CCCCCCCCCCCCCCC-CCCCCCCCCCCCC @NTACCGGACGGATGGG:VH00580:38:AACMLYGM5:1:1101:65475:1284 3:N:0:TTTGGACG TGGTTAAACTAGGGAATGATTTTGAATAGAGCACAATAGCATTAAGAGTCG + CCCC;CC;C-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CC @NGAGAGATGGCTTATC:VH00580:38:AACMLYGM5:1:1101:65627:1284 3:N:0:TTTGGACG GTTAAGCAAAGTGTACTTTGAGGTATCATCGTTAAAAAAATAAAATTGATG "

Thoughts? and thank you!

hukai916 commented 1 year ago

Seems that the bwa mem step is to blame. I suggest the following debugging steps.

  1. go to the working dir: cd /home/blackmonlab/Desktop/scATACpipe/scATACpipe/work/8d/b51980964f520ea4ee9555e00a3dfc
  2. activate the corresponding docker container (should be hukai916/bwa_xenial:0.1)
  3. the actual commands being executed by this module are (.command.sh):

    filename=$(basename bwa_index/*.bwt)
    index_name="${filename%.*}"
    
    sample_name=R1_Sample_S1_1.barcoded.trimmed.fastq.gz
    outname="${sample_name%%.*}"
    outname="${outname#R1_}"
    
    bwa mem  -t 32 bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam

    Try run each command at a time, make sure index_name is correct and the index file is actually there. Let me know,

priscilla-glenn commented 1 year ago

Thank you. I am actually about to head on vacation now myself for a week so I won't be able to test this out until I get back.

priscilla-glenn commented 1 year ago

Hi, I'm back in the lab and working on debugging this. When I mount my files to the container, it is unable to use files that are being pointed to by another file and gives the . So, to get around this, I went to the original bwa_index folder at 6c/9d5c273a42ec4f4369368036147c95.

I ran the naming code and everything was correct.

I then went to the folders with the actual sample files at c6/518fd88378bc056f65ec7bf026a2be and ran the following code

bwa mem -t 32 ../../6c/9d5c273a42ec4f4369368036147c95/bwa_index/$index_name R1_Sample_S1_1.barcoded.trimmed.fastq.gz R2_Sample_S1_1.barcoded.trimmed.fastq.gz | samtools sort -@ 32 -m 805306368 -O bam -o ${outname}.sorted.bam

And it ran! I have an output file called "Sample_S1_1.sorted.bam" in my current directory: c6/518fd88378bc056f65ec7bf026a2be

Given this, I'm not sure why when its not in the direct container, it was unable to find the index files. Thoughts?

priscilla-glenn commented 1 year ago

Hi quick question, I'm going through the pipeline manually now and how was the adapter chosen to be used in cutadapt?

Thanks!

hukai916 commented 1 year ago

Hi, the adapters used for cutadapt is default to the first 13bp used in Illumina standard adapter. They can be configured to use other sequences here: https://github.com/hukai916/scATACpipe/blob/8e0048985ee2e8fc2a466065ec9d96b2de6c9517/nextflow.config#L31-L32

priscilla-glenn commented 1 year ago

Thanks hukai916. I'm progressing through the steps well. I'm now at the DEDUP_BAM module and am curious what I should use as the $barcode_tag?

hukai916 commented 1 year ago

In scATACpipe, the DEDUP_BAM is called twice. The first run is a rough de-duplication without cell barcode correction, the aim is to find out the inflection point to determine the valid cell numbers. The second run is to deduplicate with the corrected cell barcodes saved in the "CB" tag.

Therefore, for the first run, the barcode_tag is set to "N/A" while for the second run, it is set to "CB". These will be set automatically if you were able to run the pipeline automatically instead of manually. For your convenience, the corresponding lines are below: First run: https://github.com/hukai916/scATACpipe/blob/8e0048985ee2e8fc2a466065ec9d96b2de6c9517/workflows/preprocess_default.nf#L159 Second run: https://github.com/hukai916/scATACpipe/blob/8e0048985ee2e8fc2a466065ec9d96b2de6c9517/workflows/preprocess_default.nf#L229

It is still mysterious to me why you can not run the pipeline automatically in your environment.

priscilla-glenn commented 1 year ago

Thank you! And I'm not sure either but going through the pipeline step by step has been good for me to learn the process at least. It also means I can more easily go back and adjust any mapping steps as well if need. I'll probably try to rerun the pipeline in the near future but for now, I'll keep doing this.

As for the remove_dup step. I tried running it and ran into this error. Do you know which version of sinto you used?

python3 ../../../remove_duplicate.py --inbam mm39.filtered.bam --barcode_tag "N/A" --outdir ./ --outbam mm39.dedup.bam --nproc 75
Traceback (most recent call last):
  File "/home/blackmonlab/Desktop/Priscilla/SingleCell/mm39/raw_data/Fastqs/../../../remove_duplicate.py", line 271, in <module>
    intervals = utils.chunk_bam_by_chr(bam_temp, nproc)
                ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'sinto.utils' has no attribute 'chunk_bam_by_chr'
priscilla-glenn commented 12 months ago

Hi, when I investigated Sinto.utils, https://github.com/timoast/sinto/blob/master/sinto/utils.py, I find chunk_bam but nothing called chunk_bam_by_chr. Was this in an older version? Did you create it? Is there a way that I can use this segment of the pipeline and then export the results?

I tried running chunk_bam itself and it froze my computer while creating empty temporary files whose names sometimes started on one chr and ended on another. So not exactly sure what happened but I'm digging into it.

Thanks for all your help.

hukai916 commented 11 months ago

Hi,

The sinto is a customized version. When you run the pipeline, the docker container ("hukai916/sinto_kai:0.3") that harbors this customized sinto will be downloaded automatically. You can find the script here: https://github.com/hukai916/sinto/tree/master/sinto

The function chunk_bam_by_chr is used for remove duplicates (DEDUP_BAM module). The chunk_bam function implemented in the original sinto package may group multiple chromosome regions into the same chunk, which is not compatible with our multi-processing strategy, therefore, we implemented chunk_bam_by_chr so that each chunk only come from a single chromosome.

priscilla-glenn commented 11 months ago

Hi, I downloaded the sinto folder and put it into the same folder with the remove_duplicate.py script. I called it sinto2 and updated the script to import utils from sinto2.

I then tried running the script as shown down below but am getting weird results with chunks to scaffolds and such. Any ideas why this would be or how I should download your version of sinto?

The init.py script is still finding the metadata for "sinto"


(scATACpipe) blackmonlab@blackmonlab-M-CLASS:~/Desktop/Priscilla/SingleCell/mm39/raw_data/Fastqs$ python3 ../../../remove_duplicate.py --inbam mm39.filtered.bam --barcode_tag "N/A" --outdir ./ --outbam mm39.dedup.bam --nproc 50
Processing  chunk_chr1_0.0_to_chrJH584295.1_39.52  ...
Processing  chunk_chr1_3903085.58_to_chrJH584295.1_79.04  ...
Processing  chunk_chr1_7806171.16_to_chrJH584295.1_118.56  ...
Processing  chunk_chr1_11709256.74_to_chrJH584295.1_158.08  ...
Processing  chunk_chr1_15612342.32_to_chrJH584295.1_197.60000000000002  ...
hukai916 commented 11 months ago

Which sinto folder did you download? I assume you are running the script locally, if so, do you have 50 cores (nproc = 50)? Can you share the input bam so that i can try replicatethe issue? Thanks,

priscilla-glenn commented 11 months ago

Hi, I git cloned sinto from the link you shared and then renamed the sinto folder within that to sinto2 which I copied to my working directory. I then downloaded the remove_duplicates.py script into the same working directory and updated it to import sinto2. My computer has upto 100 cpus available.

Here is a link to the original bam file and the filtered bam I created. https://drive.google.com/file/d/1lc0nscqrh3zWaCfkXUxTpewBKfjPJTj7/view?usp=sharing

priscilla-glenn commented 11 months ago

Hi, I just wanted to follow up and check if you were able to download the bam file correctly?

Here is a workflow document I have for what I've been doing. https://docs.google.com/document/d/1a6F1WyY4PI8PJtYyo2VBMqJt_Z_mxKELRK8wbBFavDk/edit?usp=sharing

hukai916 commented 11 months ago

Can you grant me access? Thanks!

priscilla-glenn commented 11 months ago

Shared!

hukai916 commented 11 months ago

Thanks! I confirm that I have access to both files now, let me look into it and let you know.

hukai916 commented 11 months ago

The bam files are too large to download, can you try downsampling it to 10% and share the link? You can use samtools view -b -s 0.1 foo.bam > sampled.bam for randomly downsampling 10% of the BAM. Thanks,

priscilla-glenn commented 11 months ago

You should be receiving a notice from google with the new shared files. Thanks!

hukai916 commented 11 months ago

Hi @priscilla-glenn, I was able to download and downsample your original bam file to 5% for testing purpose. I can repeat what you observed with the custom sinto (that you saved as sinto2). If you see info like below:

Processing  chunk_chr1_0.0_to_chrJH584295.1_329.3333333333333  ...
Processing  chunk_chr1_32525713.166666668_to_chrJH584295.1_658.6666666666666  ...
Processing  chunk_chr1_65051426.333333336_to_chrJH584295.1_988.0  ...
Processing  chunk_chr1_97577139.5_to_chrJH584295.1_1317.3333333333333  ...
Processing  chunk_chr1_130102852.66666667_to_chrJH584295.1_1646.6666666666665  ...
Processing  chunk_chr1_162628565.83333334_to_chrJH584295.1_1976.0  ...

It basically means that you are actually invoking the correct sinto version. The function chunk_bam_by_chr would split each chromosome into --nproc pieces and process every ith piece from each chromosome in one processor. For example, if --nproc 10, each chromosome/scaffold would be split into 10 pieces. and processor1 would analyze the first piece from each chromosome/scaffold; likewise, processor2 would analyze the second piece from each chromosome/scaffold, and so on. Meanwhile, each parallel job would print out information consisting of the start coordinate of the first chromosome piece and the end coordinate of the last piece that it is processing, which is shown above. Let me know if unclear.

priscilla-glenn commented 11 months ago

Thank you very much. And so that means the final output is still correct even though the printing in the terminal is as shown above? Or do I need to adjust the code in some way? Thank you so much for your help in this!

hukai916 commented 11 months ago

Correct, though the printing is sort of confusing, the results should be right, no need to adjust. When the run finishes, you should see a summary info like below.

Soft_clipped (reads):  631524
No corrected barcode (of all paired fragments):  0
Not properly mapped (fragments):  0
Total unique fragments:  20434638
Total duplicate fragments:  194386
priscilla-glenn commented 11 months ago

Thank you very much, it worked!

On the get_whitelist_barcode.py script, it states """ Given whitelist_barcode folder, index_fastq, outfile_prefix, select the correct whitelist with the highest overlapping fraction. If fraction less than 0.5, exits with error msg. """

Wouldn't you want the barcode_fastq instead of the index_fastq?

hukai916 commented 11 months ago

The index_fastq here refers to the fastq file containing the cell barcodes, or called cell index.