pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
154 stars 23 forks source link

Issue validating .mtx files produced by kb count #269

Closed kaushik-roy-physics closed 1 month ago

kaushik-roy-physics commented 2 months ago

Hello,

I am facing a rather strange issue while running kb count. I am using the workflow=nac and it runs perfectly for my other set of fastq files. But for this one, it produced similar error but on two different .mtx files during the final validation phase.

In the first instant, it gave the same error on the cells_x_genes.nascent.mtx file. When I opened the file using nano and looked at the line which showed invalid integer value, I saw a huge portion marked with @@@@@@@@@@@@@@@@@@. I have never seen something like this before. When I deleted those lines and rerun, it showed an error saying some lines are missing.

I did a second run on a different machine and this time, it shows the same issue but on the cells_x_genes.ambiguous.mtx file. The image of the Line 27491870 with the @@@@@@@@@@ can be seen in the attachment.

Any idea what is causing this issue and how it might be resolved.

Best, Kaushik

The script I used was:

kb count --h5ad --verbose --overwrite -i ~/rdfmount/Kaushik/kallisto/index.idx -g ~/rdfmount/Kaushik/kallisto/t2g.txt -x 10xv3 -o ~/rdfmount/Kaushik/kallisto/48h/rerun/ -c1 ~/rdfmount/Kaushik/kallisto/cdna.txt -c2 ~/rdfmount/Kaushik/kallisto/nascent.txt --sum=total --workflow=nac ~/rdfmount/Kaushik/kallisto/48h/48h_R1_001.fastq.gz ~/rdfmount/Kaushik/kallisto/48h/48h_R2_001.fastq.gz

The indices used are the pre-built indices available in your repository.

The relevant portion of the error is: Screenshot from 2024-09-12 19-20-11

[2024-09-12 18:51:56,301] ERROR [main] An exception occurred Traceback (most recent call last): File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/validate.py", line 55, in validate_mtx scipy.io.mmread(path) File "/home/kaushik/anaconda3/lib/python3.11/site-packages/scipy/io/_fast_matrix_market/init.py", line 363, in mmread triplet, shape = _read_body_coo(cursor, generalize_symmetry=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/kaushik/anaconda3/lib/python3.11/site-packages/scipy/io/_fast_matrix_market/init.py", line 149, in _read_body_coo _fmm_core.read_body_coo(cursor, i, j, data) ValueError: Line 27491870: Invalid integer value.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/main.py", line 1618, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/main.py", line 592, in parse_count count_nac( File "/home/kaushik/anaconda3/lib/python3.11/site-packages/ngs_tools/logging.py", line 62, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/count.py", line 1966, in count_nac count_result = bustools_count( ^^^^^^^^^^^^^^^ File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/validate.py", line 121, in inner validate(path) File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/validate.py", line 88, in validate VALIDATORSext File "/home/kaushik/anaconda3/lib/python3.11/site-packages/kb_python/validate.py", line 57, in validate_mtx raise ValidateError(f'{path} is not a valid matrix market file') kb_python.validate.ValidateError: /home/kaushik/rdfmount/Kaushik/kallisto/48h/rerun/counts_unfiltered/cells_x_genes.ambiguous.mtx is not a valid matrix market file

Yenaled commented 2 months ago

Can you run tail on those files? nano can sometimes cause some display issues.

kaushik-roy-physics commented 2 months ago

Update: I reran the same code for a third time after posting the issue and it worked this time.

Also I should mention that I tried to open the file, mentioned in my earlier post, separately using scipy.io.mmread and the error: ValueError: Line 27491870: Invalid integer value,

pops up. So there was a corrupt .mtx file produced somehow in earlier runs. The fastq files contained data from multiplexed scRNA sequencing and had 2871961 barcodes. Maybe it is connected to the size of the datasets, but it seems to be random and fixed by just attempting multiple runs. Still, it would be useful to know the cause of the corrupt .mtx files.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days