pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
147 stars 23 forks source link

0 Reads Pseudoaligned #153

Closed nhutchins627 closed 2 years ago

nhutchins627 commented 2 years ago

Hi! I was just trying to use the kb tools to get the spliced/unspliced counts for RNA velocity. I was trying to align reads from the FASTQ files: https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-6967/22109_1_TCTTAGGC_S8_L001_I1_001.fastq.gz https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-6967/22109_1_TCTTAGGC_S8_L001_R1_001.fastq.gz https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-6967/22109_1_TCTTAGGC_S8_L001_R2_001.fastq.gz https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-6967/22109_1_TCTTAGGC_S8_L001_R3_001.fastq.gz

The technology is 10xv1. And I made a mouse reference index file with the follow command (to GRC38.92): kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow lamanno -n 8 Mus_musculus.GRCm38.dna.primary_assembly.fa.gz Mus_musculus.GRCm38.92.gtf.gz --overwrite Afterward, I tried to align the files to that genome with the following command: kb count --h5ad -i index.idx_cdna,index.idx_intron.0,index.idx_intron.1,index.idx_intron.2,index.idx_intron.3,index.idx_intron.4,index.idx_intron.5,index.idx_intron.6 -g t2g.txt -x 10xv2 -o SRR6470907 -c1 spliced_t2c.txt -c2 unspliced_t2c.txt --workflow lamanno --filter bustools 22109_1_TCTTAGGC_S8_L001_R* But receive the following error: `[2021-11-30 11:03:47,242] WARNING [main] Multiple indices were provided. Aligning to split indices is currently EXPERIMENTAL and results in loss of reads. It is recommended to use a single index until this feature is fully supported. Use at your own risk! [2021-11-30 11:03:49,369] INFO [count_lamanno] Generating BUS file using 8 indices [2021-11-30 11:03:49,371] INFO [count_lamanno] Using index index.idx_cdna to generate BUS file to SRR6470907/tmp/bus_part0 from [2021-11-30 11:03:49,371] INFO [count_lamanno] 22109_1_TCTTAGGC_S8_L001_R1_001.fastq.gz [2021-11-30 11:03:49,371] INFO [count_lamanno] 22109_1_TCTTAGGC_S8_L001_R2_001.fastq.gz [2021-11-30 11:03:49,371] INFO [count_lamanno] 22109_1_TCTTAGGC_S8_L001_R3_001.fastq.gz [2021-11-30 11:03:50,482] ERROR [count_lamanno] bus: unrecognized option '--kmer'

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology Error: Number of files (3) does not match number of input files required by technology 10XV2 (2) kallisto 0.46.2 Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments: -i, --index=STRING Filename for the kallisto index to be used for pseudoalignment -o, --output-dir=STRING Directory to write output to

Optional arguments: -x, --technology=STRING Single-cell technology used -l, --list List all single-cell technologies supported -B, --batch=FILE Process files listed in FILE -t, --threads=INT Number of threads to use (default: 1) -b, --bam Input file is a BAM file -n, --num Output number of read in flag column (incompatible with --bam) -T, --tag=STRING 5′ tag sequence to identify UMI reads for certain technologies --fr-stranded Strand specific reads for UMI-tagged reads, first read forward --rf-stranded Strand specific reads for UMI-tagged reads, first read reverse --unstranded Treat all read as non-strand-specific --paired Treat reads as paired --genomebam Project pseudoalignments to genome sorted BAM file -g, --gtf GTF file for transcriptome information (required for --genomebam) -c, --chromosomes Tab separated file with chromosome names and lengths (optional for --genomebam, but recommended) --verbose Print out progress information every 1M proccessed reads [2021-11-30 11:03:50,483] ERROR [main] An exception occurred Traceback (most recent call last): File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/main.py", line 1353, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/main.py", line 454, in parse_count count_velocity( File "/home/nicholas/.local/lib/python3.8/site-packages/ngs_tools/logging.py", line 62, in inner return func(*args, kwargs) File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/count.py", line 1982, in count_velocity bus_result = kallisto_bus_split( File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/count.py", line 231, in kallisto_bus_split kallisto_bus( File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/validate.py", line 116, in inner results = func(*args, *kwargs) File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/count.py", line 191, in kallisto_bus run_executable(command) File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/dry/init.py", line 25, in inner return func(args, kwargs) File "/home/nicholas/.local/lib/python3.8/site-packages/kb_python/utils.py", line 203, in run_executable raise sp.CalledProcessError(p.returncode, ' '.join(command)) subprocess.CalledProcessError: Command '/home/nicholas/.local/lib/python3.8/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index.idx_cdna -o SRR6470907/tmp/bus_part0 -x 10xv2 -t 8 --num --kmer 22109_1_TCTTAGGC_S8_L001_R1_001.fastq.gz 22109_1_TCTTAGGC_S8_L001_R2_001.fastq.gz 22109_1_TCTTAGGC_S8_L001_R3_001.fastq.gz' returned non-zero exit status 1.`

And I was wondering what to do about this - I can't figure out what the issue might be? Additionally there are 4 fastq files but the kb tools will only take a maximum of 3 for this kind of chemistry and I was wondering how to deal with that. There are also many files I'd like to align to this same reference (beyond the four supplied - maybe about 100?) and I was wondering if there were a way with regular expressions to just align them all in one run in order to generate one large count matrix.

Thank you!

Lioscro commented 2 years ago

Hi, @nhutchins627, There are a couple of things. First, I briefly skimmed through the publication that generated this data (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6522369/), and it seems that they are using 10X version 1 (so you would need to provide -x 10xv1 instead of -x 10xv2), which explains why there are 3 files. (The index file, the file containing I1 in the name, does not need to be provided as input).

Additionally, using split indices is broken in the current release because the kallisto release that is shipped with kb does not include the --kmer option, which is required. Would it be possible to use a full index? How much memory do you have on your machine? The machine you are running the command on will require at least 32GB of memory.

If not, I can provide you instructions to manually compile a compatible (development) kallisto version.

nhutchins627 commented 2 years ago

Thank you! I will change the 10xv1 thing.

Oh I'm sorry I read the tutorial again and will just change n.

Which is from the tutorial but using the 38.92 annotation.

Lioscro commented 2 years ago

The -n 8 option specifies how many files to split the index into. You can just remove that part.

Out of curiosity, how much free memory does your machine have?

nhutchins627 commented 2 years ago

I'm using the whitehead cluster so I assume that it has quite a bit of memory?

Also is there a way to align many fastq files to the same reference all at once? I have maybe about 100 that I would like to align all to the mouse genome and have one count matrix.

Lioscro commented 2 years ago

That is possible by simply providing multiple FASTQs. For instance, if you have A_R1.fastq, A_R2.fastq, B_R1.fastq, B_R2.fastq where A was generated from one run and B was generated from another, you can provide kb count ... A_R1.fastq A_R2.fastq B_R1.fastq B_R2.fastq. This, however, assumes that any overlapping barcodes correspond to the same exact cell (i.e. the UMIs from barcode 1 in sample A will be combined with the UMIs from this same barcode in sample B). If you're just trying to process multiple (different) samples, the best way is to simply run kb once for each sample.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days