Error in running BLAZE through virtualenv

anshulbudhraja commented 8 months ago

Hi! I am currently trying to run BLAZE, installed in virtualenv, in a cluster where I'm using SLURM to submit a job. The error I'm facing is:

(02/03/2024 12:22:14) Getting putative barcodes from 1 FASTQ files...
Processed: 93168000Read [31:14, 48666.91Read/s]Traceback (most recent call last):
  File "/home/anshul1/virtual_envs/blaze/lib/python3.11/site-packages/blaze/helper.py", line 285, in fastq_parser
    next(file_handle) # skip  '+'
    ^^^^^^^^^^^^^^^^^
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anshul1/virtual_envs/blaze/bin/blaze", line 8, in <module>
    sys.exit(_pipeline())
             ^^^^^^^^^^^
  File "/home/anshul1/virtual_envs/blaze/lib/python3.11/site-packages/blaze/main.py", line 613, in main
    for idx, f in enumerate(rst_futures):
  File "/home/anshul1/virtual_envs/blaze/lib/python3.11/site-packages/blaze/helper.py", line 192, in multiprocessing_submit
    i = next(iterator, None)
        ^^^^^^^^^^^^^^^^^^^^
  File "/home/anshul1/virtual_envs/blaze/lib/python3.11/site-packages/blaze/main.py", line 575, in read_batch_generator
    for batch in read_batch:
  File "/home/anshul1/virtual_envs/blaze/lib/python3.11/site-packages/blaze/helper.py", line 266, in batch_iterator
    for entry in iterator:
RuntimeError: generator raised StopIteration
Processed: 93171000Read [31:15, 49690.74Read/s]

Although blaze --help is running fine in the virtualenv

(blaze) [anshul1@narval4 BLAZE_10H174]$ blaze --help

Description:
    BLAZE2 is a tool for demultiplexing 10X single cell long-read RNA-seq data.
    It takes fastq files as input and output a whitelist of barcodes and a fastq
    with demultiplexed reads.

Usage: blaze  --expect-cells <INT> [OPTIONS] <fastq directory>

Required argument:
    One of the following two options is required unless whitelisting step is turned off:
        --expect-cells <INT>
                Expected number of cells.
        --count-threshold <INT>
                Count threshold of high-quality putative barcodes used to determine the whitelist.
    Note that the --count-threshold option is ignored if --expect-cells is specified.

Options:
    -h, --help
        Print this help message.

    --output-prefix <prefix>
        Filename of output files. Default: --output-prefix
...

Total number of reads in the input file are: 108,842,294 My code/command was:

Dir="/home/anshul1/scratch/single_cell/BLAZE_10H174"
InDir="/home/anshul1/scratch/single_cell"
sample="10H174"

# using virtualenv
source /home/anshul1/virtual_envs/blaze/bin/activate

## cd to required Dir ; allows launching code from anywhere
mkdir ${Dir}/${sample}_out
outDir="${Dir}/${sample}_out"
cd $outDir

# search for putative barcode in each read and obtain the whitelist
blaze --expect-cells=7000 --output-prefix blazeOut10H174 --threads=24  ${InDir}/10H174_pass_final.fastq

My virtualenv contains the following packages

x(blaze) [anshul1@narval4 BLAZE_10H174]$ pip list
Package                            Version
---------------------------------- -------------------------
anyio                              3.7.1+computecanada
arff                               0.9+computecanada
argon2_cffi                        23.1.0+computecanada
argon2_cffi_bindings               21.2.0+computecanada
asttokens                          2.2.1+computecanada
async_generator                    1.10+computecanada
attrs                              23.1.0+computecanada
backcall                           0.2.0+computecanada
backports-abc                      0.5+computecanada
backports.shutil_get_terminal_size 1.0.0+computecanada
bcrypt                             4.0.1+computecanada
beautifulsoup4                     4.12.2+computecanada
bitarray                           2.8.1+computecanada
bitstring                          4.1.1+computecanada
blaze2                             2.1.4
bleach                             6.0.0+computecanada
certifi                            2023.7.22+computecanada
cffi                               1.15.1+computecanada
chardet                            5.2.0+computecanada
charset_normalizer                 3.2.0+computecanada
comm                               0.1.4+computecanada
contourpy                          1.1.0+computecanada
cryptography                       39.0.1+computecanada
cycler                             0.11.0+computecanada
Cython                             0.29.36+computecanada
deap                               1.4.1+computecanada
debugpy                            1.6.7.post1+computecanada
decorator                          5.1.1+computecanada
defusedxml                         0.7.1+computecanada
dnspython                          2.4.2+computecanada
ecdsa                              0.18.0+computecanada
entrypoints                        0.4+computecanada
executing                          1.2.0+computecanada
fast_edit_distance                 1.2.1+computecanada
fastjsonschema                     2.18.0+computecanada
fonttools                          4.42.1+computecanada
funcsigs                           1.0.2+computecanada
idna                               3.4+computecanada
importlib_metadata                 6.8.0+computecanada
importlib_resources                6.0.1+computecanada
ipykernel                          6.25.1+computecanada
ipython                            8.15.0+computecanada
ipython_genutils                   0.2.0+computecanada
jedi                               0.19.0+computecanada
Jinja2                             3.1.2+computecanada
jsonschema                         4.19.0+computecanada
jsonschema_specifications          2023.7.1+computecanada
jupyter_client                     8.3.1+computecanada
jupyter_core                       5.3.1+computecanada
kiwisolver                         1.4.5+computecanada
lockfile                           0.12.2+computecanada
MarkupSafe                         2.1.3+computecanada
matplotlib                         3.7.2+computecanada
matplotlib_inline                  0.1.6+computecanada
mistune                            3.0.1+computecanada
mock                               5.1.0+computecanada
mpmath                             1.3.0+computecanada
nest_asyncio                       1.5.7+computecanada
netaddr                            0.8.0+computecanada
netifaces                          0.11.0+computecanada
nose                               1.3.7+computecanada
numpy                              1.25.2+computecanada
packaging                          23.1+computecanada
pandas                             2.1.0+computecanada
pandocfilters                      1.5.0+computecanada
paramiko                           3.3.1+computecanada
parso                              0.8.3+computecanada
path                               16.7.1+computecanada
path.py                            12.5.0+computecanada
pathlib2                           2.3.7.post1+computecanada
paycheck                           1.0.2+computecanada
pbr                                5.11.1+computecanada
pexpect                            4.8.0+computecanada
pickleshare                        0.7.5+computecanada
Pillow                             10.0.0+computecanada
pip                                24.0+computecanada
pkgutil_resolve_name               1.3.10+computecanada
platformdirs                       3.9.1+computecanada
prometheus_client                  0.17.1+computecanada
prompt_toolkit                     3.0.39+computecanada
psutil                             5.9.5+computecanada
ptyprocess                         0.7.0+computecanada
pure_eval                          0.2.2+computecanada
pycparser                          2.21+computecanada
Pygments                           2.16.1+computecanada
PyNaCl                             1.5.0+computecanada
pyparsing                          3.0.9+computecanada
pyrsistent                         0.19.3+computecanada
python-dateutil                    2.8.2+computecanada
python_json_logger                 2.0.7+computecanada
pytz                               2023.3+computecanada
PyYAML                             6.0.1+computecanada
pyzmq                              25.1.1+computecanada
referencing                        0.30.2+computecanada
requests                           2.31.0+computecanada
rfc3339_validator                  0.1.4+computecanada
rfc3986_validator                  0.1.1+computecanada
rpds_py                            0.10.0+computecanada
scipy                              1.11.2+computecanada
Send2Trash                         1.8.2+computecanada
setuptools                         69.0.3
simplegeneric                      0.8.1+computecanada
singledispatch                     4.1.0+computecanada
six                                1.16.0+computecanada
sniffio                            1.3.0+computecanada
soupsieve                          2.4.1+computecanada
stack_data                         0.6.2+computecanada
sympy                              1.12+computecanada
terminado                          0.17.1+computecanada
testpath                           0.6.0+computecanada
tinycss2                           1.2.1+computecanada
tornado                            6.3.3+computecanada
tqdm                               4.66.2+computecanada
traitlets                          5.9.0+computecanada
typing_extensions                  4.7.1+computecanada
tzdata                             2023.3+computecanada
urllib3                            2.0.4+computecanada
wcwidth                            0.2.6+computecanada
webencodings                       0.5.1+computecanada
websocket_client                   1.6.2+computecanada
wheel                              0.41.3
zipp                               3.16.2+computecanada

Edit: I have successfully run the test (from the BLAZE/test) folder using the blaze command through the virtualenv. I launched the job through SLURM in the same manner but it failed on my file, although it succeeded with the test data.

(blaze) [anshul1@narval4 test]$ ls -lh test_out/
total 100M
-rw-r-----. 1 anshul1 anshul1  34K Mar  4 13:24 test_emtpy_bc_list.csv
-rw-r-----. 1 anshul1 anshul1  31K Mar  4 13:24 test_knee_plot.png
-rw-r-----. 1 anshul1 anshul1  96M Mar  4 13:24 test_matched_reads.fastq.gz
-rw-r-----. 1 anshul1 anshul1 9.2M Mar  4 13:24 test_putative_bc.csv
-rw-r-----. 1 anshul1 anshul1  759 Mar  4 13:24 test_summary.txt
-rw-r-----. 1 anshul1 anshul1  14K Mar  4 13:24 test_whitelist.csv
(blaze) [anshul1@narval4 test]$ ls -lh expect_output/
total 100M
-rw-r-----. 1 anshul1 anshul1  34K Mar  4 11:59 test_emtpy_bc_list.csv
-rw-r-----. 1 anshul1 anshul1  31K Mar  4 11:59 test_knee_plot.png
-rw-r-----. 1 anshul1 anshul1  96M Mar  4 11:59 test_matched_reads.fastq.gz
-rw-r-----. 1 anshul1 anshul1 9.2M Mar  4 11:59 test_putative_bc.csv
-rw-r-----. 1 anshul1 anshul1  759 Mar  4 11:59 test_summary.txt
-rw-r-----. 1 anshul1 anshul1  14K Mar  4 11:59 test_whitelist.csv

Also, the job (using my data) had time allocation for 2 days but it failed in a few hours. Any help would be much appreciated!

youyupei commented 8 months ago

Hi @anshulbudhraja ,

Thank you for your interest in BLAZE. Since you've successfully run BLAZE on the test dataset, the issues are less likely from the configuration of your environment. One potential problem could relate to your FASTQ file. Is it possible that your FASTQ file is incomplete, perhaps due to accidentally deleted lines? A good starting point for troubleshooting would be to check whether the number of lines in your FASTQ file is a multiple of 4.

anshulbudhraja commented 8 months ago

Hi @youyupei , Thanks for the reply! My file shows:

$ wc -l 10H174_pass_final.fastq
435369176

I'm now trying with different fastq with fewer reads,

$ wc -l 10H174_pass_sub50M.fastq
200000000

and the progress is:

(05/03/2024 11:45:09) Getting putative barcodes from 1 FASTQ files...
Processed: 50000000Read [20:26, 40753.60Read/s]
Counting high-quality putative BC: 50it [01:11,  1.44s/it]
(05/03/2024 12:06:49) Getting barcode whitelist and empty droplet barcode list...

(05/03/2024 12:06:54) Creating emtpy droplets barocde list...
(05/03/2024 12:11:46) Assigning reads to whitelist.

Processed: 50000000Read [1:10:02, 11897.51Read/s]
(05/03/2024 13:21:48) Reads assignment completed. Demultiplexed read saved in blazeOutsub50Mmatched_reads.fastq.gz!

I believe you're right that the issue must be with the original fastq I was working with. Thank you for your time!

shimlab / BLAZE

Error in running BLAZE through virtualenv #14