pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
151 stars 23 forks source link

Error during summing matrices step #268

Closed palatinate closed 1 month ago

palatinate commented 2 months ago

When using --sum cell i get the following error: ValueError: invalid literal for int() with base 10: '1.55798e+06'

Command:

kb count --h5ad --verbose --bootstraps=100 -t 45 --strand=reverse --parity=paired -m 200G -x BULK --workflow=nac -c1 /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/cdna.txt -c2 /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/nascent.txt -o nascent -i /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/index.idx -g /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/t2g.txt --sum cell --sum total --matrix-to-directories batch.txt

Output: [2024-08-21 00:53:05,865] INFO [count_nac] Inspecting BUS file nascent/tmp/output.s.bus [2024-08-21 00:53:05,865] DEBUG [count_nac] bustools inspect -o nascent/inspect.json nascent/tmp/output.s.bus [2024-08-21 00:53:06,968] INFO [count_nac] Generating count matrix nascent/counts_unfiltered/cells_x_genes from BUS file nascent/tmp/output.s.bus [2024-08-21 00:53:06,968] DEBUG [count_nac] bustools count -o nascent/counts_unfiltered/cells_x_genes -g /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/t2g.txt -e nascent/matrix.ec -t nascent/transcripts.txt -s /data/projects/rna_host_respons/nextflow_kb_python/grch38/nascent/nascent.txt --genecounts --cm nascent/tmp/output.s.bus [2024-08-21 00:55:05,190] DEBUG [count_nac] nascent/counts_unfiltered/cells_x_genes.mature.mtx passed validation [2024-08-21 00:55:05,205] DEBUG [count_nac] nascent/counts_unfiltered/cells_x_genes.nascent.mtx passed validation [2024-08-21 00:55:05,220] DEBUG [count_nac] nascent/counts_unfiltered/cells_x_genes.ambiguous.mtx passed validation [2024-08-21 00:55:05,220] INFO [count_nac] Writing gene names to file nascent/counts_unfiltered/cells_x_genes.genes.names.txt [2024-08-21 00:55:05,481] WARNING [count_nac] 14053 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead. [2024-08-21 00:55:05,515] INFO [count_nac] Summing matrices into nascent/counts_unfiltered/cells_x_genes.cell.mtx [2024-08-21 00:55:05,550] ERROR [main] An exception occurred Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/kb_python/main.py", line 1618, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/usr/local/lib/python3.10/dist-packages/kb_python/main.py", line 592, in parse_count count_nac( File "/usr/local/lib/python3.10/dist-packages/ngs_tools/logging.py", line 62, in inner return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/kb_python/count.py", line 2020, in count_nac sums['cell'] = do_sum_matrices( File "/usr/local/lib/python3.10/dist-packages/kb_python/utils.py", line 811, in do_sum_matrices _nums2[2] = int(_nums2[2]) ValueError: invalid literal for int() with base 10: '1.55798e+06' [2024-08-21 00:55:05,557] DEBUG [main] Removing nascent/tmp directory

I assume _nums2[2] = int(_nums2[2]) needs to change to _nums2[2] = float(_nums2[2]) or _nums2[2] = int(float(_nums2[2]))

Yenaled commented 2 months ago

Interesting… I think when numbers are too big, scientific notation gets written out and int() doesn’t work on that format. Looks like something that I’ll have to fix in the next version of kb-python (planning a new release soon and I’ll incorporate this update then).

In the meantime, don’t worry about doing —sum, you can just load the three separate matrices into python and sum them yourself.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days