seqscope / spatula

A C++ Tool for Spatial Transcriptomics
https://seqscope.github.io/spatula/
Apache License 2.0
1 stars 0 forks source link

Question on custom_demux_fastq #5

Closed chlee-tabin closed 5 months ago

chlee-tabin commented 5 months ago

First, minor issue.

The following documentation

https://github.com/seqscope/spatula/blob/main/docs/tools/custom_demux_fastq.md

renders somewhat differently on either Firefox/Chrome in https://seqscope.github.io/spatula/tools/custom_demux_fastq/

image

Second, in using spatula custom_demux_fastq command, it is advised to include the Undetermined fastq files generated from the regular bcl2fastq pipeline?

hyunminkang commented 5 months ago

Thank you for the feedback. The formatting issue should now be resolved.

There are two expected use cases of spatula custom_demux_fastq:

(1) If you have access to the full BCL file, instead of demultiplexing individual samples, you may run bcl2fastq without demultiplexing, but creating index sequences.

bcl2fastq -R ${bcldir} -o ${outdir} --create-fastq-for-index-reads

Then all reads in the FASTQ file will be written into "Undetermined" FASTQs. You can use spatula custom_demux_fastq as input to demultiplex FASTQ files instead.

(2) If you already have demultiplexed FASTQ files, you may already have FASTQ files that are demultiplexed individual samples. You typically do NOT need to run spatula custom_demux_fastq on the FASTQ files that are successfully demultiplexed, as the results will look very similar to the default bcl2fastq pipeline.

However, if you have a substantial amount of "Undetermined" reads remaining, you may want use spatula custom_demux_fastq to further demultiplex the reads. Because Illumina's bcl2fastq pipeline typically performs demulitplexing in a conservative way, you may be able to rescue some of the reads with this tool.

Note that, if you have a very large number of samples demultiplexed in a single run, modifying the default parameter (e.g. using --min-diff 1 or --max-mismatch 1 may be necessary to achieve more sensible results.

I will modify the documentation to include this information.

chlee-tabin commented 5 months ago

Thank you. The run creates a Segfault.

/var/spool/slurmd/job34528088/slurm_script: line 28: 17109 Segmentation fault      ~/workspace/seqscope/spatula/bin/spatula custom-demux-fastq --R1 Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz --R2 Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz --I1 Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz --I2 Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz --sample sample.tsv --out TEST1/

I have assigned 10 cores, I wasn't sure with the memory requirement, so tried 10G, then 20G. Is there an option to debug what is going on with this?

hyunminkang commented 5 months ago

I don't think this uses a lot of memory. Having a segfault is not informative enough to figure out what the issue was. Do you have the log file?

chlee-tabin commented 5 months ago

Where can I find the log file? As for the STDERR output here it is:

[/home/cl266/workspace/seqscope/spatula/bin/spatula custom-demux-fastq] -- Demultiplex FASTQ files based in a customized manner

 Copyright (c) 2022-2024 by Hyun Min Kang
 Licensed under the Apache License v2.0 http://www.apache.org/licenses/

Available Options

The following parameters are available. Ones with "[]" are in effect:
    Input options : --R1 [Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz],
                    --R2 [Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz],
                    --I1 [Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz],
                    --I2 [Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz],
                    --sample [sample.tsv]
         Settings : --cmd [gzip -c], --consider-N-as-mismatch,
                    --max-mismatch [2], --min-diff [2]
   Output Options : --out [TEST1/], --suffix-R1 [.R1.fastq.gz],
                    --suffix-R2 [.R2.fastq.gz], --suffix-I1 [.I1.fastq.gz],
                    --suffix-I2 [.I2.fastq.gz], --ambiguous [ambiguous]

Run with --help for more detailed help messages of each argument.

NOTICE [2024/03/21 14:13:40] - Analysis started
NOTICE [2024/03/21 14:13:41] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz
NOTICE [2024/03/21 14:13:41] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz
NOTICE [2024/03/21 14:13:41] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz
NOTICE [2024/03/21 14:13:41] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz
NOTICE [2024/03/21 14:13:41] - Reading the input FASTQ files
NOTICE [2024/03/21 14:13:41] - Successfully opened 16 pipes
/var/spool/slurmd/job34528721/slurm_script: line 28:  2228 Segmentation fault      ~/workspace/seqscope/spatula/bin/spatula custom-demux-fastq --R1 Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz --R2 Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz --I1 Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz --I2 Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz --sample sample.tsv --out TEST1/
hyunminkang commented 5 months ago

I meant the output message you provided. It looks that the tool segfaults in the beginning, but unclear why.

I pushed an update in "dev" branch to allow a little more output. Can you try that?

git pull
git checkout dev
cd build; make; cd
spatula custom-demux-fastq [options] --verbose-chunk 1

I would like to know if segfault happens before or after reading the input.

If you want to debug yourself, you should be able to find which function reports errors using gdb. If you change CMakeList.txt to allow debugging (e.g. remove -O3 and add -g in CMAKE_CXX_FLAGS).

You may also want to send the input files (perhaps first 10,000 lines of each FASTQ files and sample index), if you want me to help debug your particular case.

chlee-tabin commented 5 months ago
~/workspace/seqscope/spatula/bin/spatula custom-demux-fastq \
>   --R1 Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz \
>   --R2 Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz \
>   --I1 Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz \
>   --I2 Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz \
>   --sample sample.tsv \
>   --out TEST1/ --verbose-chunk 1
[/home/cl266/workspace/seqscope/spatula/bin/spatula custom-demux-fastq] -- Demultiplex FASTQ files based in a customized manner

 Copyright (c) 2022-2024 by Hyun Min Kang
 Licensed under the Apache License v2.0 http://www.apache.org/licenses/

Available Options

The following parameters are available. Ones with "[]" are in effect:
    Input options : --R1 [Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz],
                    --R2 [Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz],
                    --I1 [Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz],
                    --I2 [Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz],
                    --sample [sample.tsv]
         Settings : --cmd [gzip -c], --consider-N-as-mismatch,
                    --max-mismatch [2], --min-diff [2], --verbose-chunk [1]
   Output Options : --out [TEST1/], --suffix-R1 [.R1.fastq.gz],
                    --suffix-R2 [.R2.fastq.gz], --suffix-I1 [.I1.fastq.gz],
                    --suffix-I2 [.I2.fastq.gz], --ambiguous [ambiguous]

Run with --help for more detailed help messages of each argument.

NOTICE [2024/03/21 14:41:51] - Analysis started
NOTICE [2024/03/21 14:41:51] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_R1_001.fastq.gz
NOTICE [2024/03/21 14:41:51] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_R2_001.fastq.gz
NOTICE [2024/03/21 14:41:51] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_I1_001.fastq.gz
NOTICE [2024/03/21 14:41:51] - Successfully opened Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz
NOTICE [2024/03/21 14:41:51] - Reading the input FASTQ files
NOTICE [2024/03/21 14:41:51] - Opening the output files for writing
NOTICE [2024/03/21 14:41:51] - Successfully opened 16 pipes
NOTICE [2024/03/21 14:41:51] - Started reading the input FASTQ files...
Segmentation fault

Looks like it seg faults immediately starting reading the input FASTQ files.

hyunminkang commented 5 months ago

@chlee-tabin The sample.tsv file has to have the header line, and that gives a silent error. I will update the documentation later to make it clearer, but I just pushed an update to spit error messages when the expected headers are not present.

chlee-tabin commented 5 months ago

Thank you, that solved this issue!

chlee-tabin commented 5 months ago

P.S. https://seqscope.github.io/spatula/tools/reformat_fastqs/ also has formatting issues same as custom...(which has been fixed)

hyunminkang commented 5 months ago

Thank you for the report @chlee-tabin. This is now fixed, too.