pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
114 stars 30 forks source link

Error: Number of files (1) does not match number of input files required by technology 10XV1 (3) #60

Open Gabrisfon opened 3 months ago

Gabrisfon commented 3 months ago

Hello everyone. First of all, I'd like to congratulate you on this incredible tool.

In my current project, I need to create cell x transcript matrices for my single nucleus human sample data generated using 10x technology. I have been trying to process this data using kallisto bus following this tutorial but unfortunately none of the indexes I have created have been compatible with 10xv1 technology and so I have been getting the following error:

**[gabrielfonseca@login 02_kallisto_bus]$ srun kb count --tcc -i index_HS_98.idx -g t2g.txt -c1 cDNA_t2c.txt -c2 introns_t2c.txt -x 10xv1 -o /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ --filter bustools -t 28 --workflow nucleus /sn_ovary/data/00_raw_data/input/raw_data/ [2024-03-09 12:28:07,487] INFO [count_lamanno] Using index index_HS_98.idx to generate BUS file to /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ from [2024-03-09 12:28:07,488] INFO [count_lamanno] /sn_ovary/data/00_raw_data/input/raw_data/ [2024-03-09 12:28:08,599] ERROR [count_lamanno] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology Error: Number of files (1) does not match number of input files required by technology 10XV1 (3) kallisto 0.48.0 Generates BUS files for single-cell sequencing Usage: kallisto bus [arguments] FASTQ-files Required arguments: -i, --index=STRING Filename for the kallisto index to be used for pseudoalignment -o, --output-dir=STRING Directory to write output to Optional arguments: -x, --technology=STRING Single-cell technology used -l, --list List all single-cell technologies supported -B, --batch=FILE Process files listed in FILE -t, --threads=INT Number of threads to use (default: 1) -b, --bam Input file is a BAM file -n, --num Output number of read in flag column (incompatible with --bam) -T, --tag=STRING 5′ tag sequence to identify UMI reads for certain technologies --fr-stranded Strand specific reads for UMI-tagged reads, first read forward --rf-stranded Strand specific reads for UMI-tagged reads, first read reverse --unstranded Treat all read as non-strand-specific

--unstranded Treat all read as non-strand-specific --paired Treat reads as paired --genomebam Project pseudoalignments to genome sorted BAM file -g, --gtf GTF file for transcriptome information (required for --genomebam) -c, --chromosomes Tab separated file with chromosome names and lengths (optional for --genomebam, but recommended) --verbose Print out progress information every 1M proccessed reads [2024-03-09 12:28:08,599] ERROR [main] An exception occurred Traceback (most recent call last): File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/main.py", line 1305, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir) File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/main.py", line 491, in parse_count count_velocity( File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-ngs-tools-1.8.1-olf5dpwkrl54xgpw6icmsugxnwpponf6/lib/python3.10/site-packages/ngs_tools/logging.py" , line 62, in inner return func(*args, kwargs) File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/count.py", line 1593, in count_velocity bus_result = kallisto_bus( File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/validate.p y", line 116, in inner results = func(*args, *kwargs) File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/count.py", line 150, in kallisto_bus run_executable(command) File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/dry/init .py", line 25, in inner return func(args, kwargs) File "/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable raise sp.CalledProcessError(p.returncode, ' '.join(command)) subprocess.CalledProcessError: Command '/opt/spack/opt/spack/linux-rocky8-x86_64/gcc-12.2.0/py-kb-python-0.27.3-wdlrh6n6qos4wtlxjqcdgtp7ys2z7gon/lib/python3.10/site-packages/kb_python/bins/linux/kallisto/kallisto bus -i index_HS_98.idx -o /sn_ovary/results/gabrielfonseca/02_kallisto_bus/01_Kbc_results/ -x 10xv1 -t 28 /sn_ovary/data/00_raw_data/input/raw_data/' returned non-zero exit status 1.**

I have created several indexes with different versions of gtf and genomes human files. The last version I tried was according to this one from 10x (References - 2020-A (July 7, 2020)) but it doesn't worked. I really need to process this data as soon as possible. This is the command I've been using to generate the indexes:

kb ref -i index_HS_98.idx -g t2g.txt -f1 ./cdna.fa -f2 ./introns.fa -c1 cDNA_t2c.txt -c2 introns_t2c.txt --workflow=nucleus /sn_ovary/data/gabrielfonseca/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.1 /sn_ovary/data/gabrielfonseca/gencode.v32.primary_assembly.annotation.gtf.gz --overwrite

Yenaled commented 3 months ago

Hi! Sorry for the semi-late reply (was traveling for the past week). 10XV1 requires three files: First file contains 14-bp barcode, second file contains the 10-bp UMI, and third file contains the read sequence to be mapped against the reference.

Additionally, you can't specify a directory (i.e. /raw_data/) as your input -- you must supply the actual FASTQ file names.

Gabrisfon commented 3 months ago

Hi Delaney, don't worry and thanks for replying. Unfortunately I haven't succeeded yet. I searched here but couldn't find a tutorial on how to get these files (First file contains 14-bp barcode, second file contains the 10-bp UMI), could you indicate one out please? the third file are my FASTQ files using their names, right?

Yenaled commented 3 months ago

Can you show me the contents of your /sn_ovary/data/00_raw_data/input/raw_data/ folder? All the FASTQ files should be there.

Gabrisfon commented 3 months ago

Yes of course.

2024-03-13

Yenaled commented 3 months ago

Hmm, ok, so you only have an R1 and an R2 file (not three files). I'm not sure how to process data in this format because I have no ideas where your barcodes are, where your UMIs are, and where your actual RNA sequence is.

You need to find someone who can tell you "My barcode is at location in file , my UMI is at location in file , my RNA sequence is at location in file ". Only then can you use kallisto (or any other RNAseq pipeline). I don't even know if your data is 10xv1 (it doesn't look like it...).

Gabrisfon commented 3 months ago

Hello Delaney, I apologize for the delay in replying as I was looking for answers to these questions. According to the protocol of the kit that was used to generate the data, the chemistry used in the gel bead is v1. The position of the barcodes and UMI is according to the photo below. The kit used was a singleplex kit, so the probe sequence is the same for all the nuclei, as shown in the link. The indices used were according to these Dual Index Kit TS, Set A (PN-1000251): CSV | JSON

Thanks in advance for your help.

image_19234862031710509871414

Yenaled commented 3 months ago

OK, based on the protocol provided, the technology appears to be 10xv3 (16 bp barcode, 12 bp UMIs -- R1 read length is 28 bp).

So supply -x 10xv3 to kb count.

Gabrisfon commented 3 months ago

I ran the command as you instructed and it's working! I was waiting for the result, but I think it may take a while. Thank you very much for your help!

[gabrielfonseca@login raw_data]$ srun kb count --tcc -i /data/02_kallisto_bus/index_HS2_98.idx -g /data/02_kallisto_bus/t2g.txt -c1 /data/02_kallisto_bus/cDNA_t2c.txt -c2 /data/02_kallisto_bus/introns_t2c.txt -x 10xv3 -o /results/02_kallisto_bus/01_Kbc_results --filter bustools -t 28 --workflow nucleus SN-1_S1_L001_R1_001.fastq.gz SN-1_S1_L001_R2_001.fastq.gz SN-1_S1_L002_R1_001.fastq.gz SN-1_S1_L002_R2_001.fastq.gz SN-2_S2_L001_R1_001.fastq.gz SN-2_S2_L001_R2_001.fastq.gz SN-2_S2_L002_R1_001.fastq.gz SN-2_S2_L002_R2_001.fastq.gz SN-3_S3_L001_R1_001.fastq.gz SN-3_S3_L001_R2_001.fastq.gz SN-3_S3_L002_R1_001.fastq.gz SN-3_S3_L002_R2_001.fastq.gz SN-4_S4_L001_R1_001.fastq.gz SN-4_S4_L001_R2_001.fastq.gz SN-4_S4_L002_R1_001.fastq.gz SN-4_S4_L002_R2_001.fastq.gz SN-5_S5_L001_R1_001.fastq.gz SN-5_S5_L001_R2_001.fastq.gz SN-5_S5_L002_R1_001.fastq.gz SN-5_S5_L002_R2_001.fastq.gz SN-6_S6_L001_R1_001.fastq.gz SN-6_S6_L001_R2_001.fastq.gz SN-6_S6_L002_R1_001.fastq.gz SN-6_S6_L002_R2_001.fastq.gz SN-7_S7_L001_R1_001.fastq.gz SN-7_S7_L001_R2_001.fastq.gz SN-7_S7_L002_R1_001.fastq.gz SN-7_S7_L002_R2_001.fastq.gz SN-8_S8_L001_R1_001.fastq.gz SN-8_S8_L001_R2_001.fastq.gz SN-8_S8_L002_R1_001.fastq.gz SN-8_S8_L002_R2_001.fastq.gz [2024-03-15 21:55:32,538] INFO [count_lamanno] Using index /data/02_kallisto_bus/index_HS2_98.idx to generate BUS file to /results/02_kallisto_bus/01_Kbc_results from [2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L001_R1_001.fastq.gz [2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L001_R2_001.fastq.gz [2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L002_R1_001.fastq.gz [2024-03-15 21:55:32,539] INFO [count_lamanno] SN-1_S1_L002_R2_001.fastq.gz [2024-03-15 21:55:32,539] INFO [count_lamanno] SN-2_S2_L001_R1_001.fastq.gz ...

I'd like to ask you a few more questions. Is it only necessary to use the --tcc parameter to generate Transcript x Cell matrices and are the parameters I used on the command line adequate for this? With this version of the tool is it still necessary to use bustools or is all the processing already done by Kb count? And finally, is it possible to pre-process the data with this tool or is it necessary to use another tool, for example Seurat?

Yenaled commented 3 months ago

Yes, you can just use —tcc for generating your TCC matrices — kb count does everything you need.

Once you have generated your matrices, that’s when you start using Seurat (knee plots, violin plots, clustering, etc.)

Gabrisfon commented 3 months ago

Thank you very much for your help and all the information!

Yours sincerely, Gabriel.

Gabrisfon commented 3 months ago

Hello Delaney, I had to reactivate this topic for another question if you can help me. After generating the quantification files with kb count using this parameter --tcc , because I want to identify the isoforms of the genes present in the cell populations later, I received as output the following files within one of the output folders counts_unfiltered: cells_x_tcc.barcodes.txt, cells_x_tcc.ec.txt and cells_x_tcc.mtx. I have tried to follow these recommendations in these links 1 and 2 to import these files with seurat, but in all of them I am failing, I have also tried to use as features the transcript.txt file that was generated outside this folder, but unfortunately it is not working either due to the number of lines being inconsistent.

Is there any other way of importing this data using seurat that is different due to its format?

expression_matrix <- ReadMtx(mtx="/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.mtx", features = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.ec.txt", cells = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.barcodes.txt", feature.column=1, mtx.transpose = TRUE)

Error in make.unique(names = feature.names) : 'names' must be a character vector

expression_matrix <- ReadMtx(mtx="/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.mtx", features = "/results/02_kallisto_bus/02_Kbc_resultsK/transcripts.txt", cells = "/results/02_kallisto_bus/02_Kbc_resultsK/counts_unfiltered/cells_x_tcc.barcodes.txt", feature.column=1, mtx.transpose = TRUE)

Error: Matrix has 13062953 rows but found 1547350 features.

Thanks in advance.

Yenaled commented 3 months ago

This is a bit tricky. You're getting the TCCs which is NOT the same as isoforms.

In any case, the first way would be the way to go if you do want to import TCCs -- I think it's getting confused about cells_x_tcc.ec.txt because the first column is a list of numbers. You might have to edit the list of numbers to be a list of strings (e.g. prepend an alphabetical character in front of each number).

Gabrisfon commented 3 months ago

These are the output files. The ones in cells_x_tcc.ec.txt do not contain the feature information that would be needed to import with Seurat. I believe that for the analysis I need to do I would need a feature file with information on the count of transcripts per cell, right? similar to the count of genes per cell generated by Cell Ranger. Is there any parameter I should use in my code to generate this information?

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.barcodes.txt AAACCCAGTAATGGCG AAACCCATCACCATAA AAACCCATCGAACGCC AAACCCATCTGACCTT AAACCCATCTGCTAGG AAACGAAAGAAACCCG AAACGAAAGAAACTCA AAACGAAAGAAAGACA AAACGAAAGAACCTCC AAACGAAAGACAACTA

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.ec.txt 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

[gabrielfonseca@login counts_unfiltered]$ tail cells_x_tcc.ec.txt 13062943 292286,292289,982741,982746,982774,982776 13062944 16518,955620,955634 13062945 83836,83839,83840,687187,955620,955634 13062946 1285587,1323410,1323412,1323414,1323417,1323419,1323422,1323425,1323427,1323431 13062947 58704,598549,598551,598552 13062948 1124772,1124775,1344511 13062949 747983,1297369 13062950 64342,285425,428088,589909,589927,589936,589941,589951,589962,589970,589986,589996,590004,590007,590016,590025,590031,590040,590048,590058,590070,590071,590080,590087,590095,590105,590114,590120,590127,590136,590142,590148,590155,590163,590172,590182,590205,590212,590220,590228,590239 13062951 63415,63416,585449,585453,585455,585457,1015962,1015963 13062952 866308,1282566,1282572,1282575,1282577,1282581,1282587,1282590,1282592,1282594,1282598,1282601,1285241

[gabrielfonseca@login counts_unfiltered]$ head cells_x_tcc.mtx %%MatrixMarket matrix coordinate real general % 54595 13062953 1245195
1 1583467 1 2 649261 1 3 1554508 1 4 55953 1 5 1577131 1 6 40130 1 6 231585 2

Yenaled commented 3 months ago

You would need to use kallisto, not just kb-python.

kallisto quant-tcc is what you would need to use. Unfortunately, I don't have a straightforward workflow for this written up but you'll welcome to play around with it. Essentially, kallisto quant-tcc estimates transcript counts using an EM algorithm, but this algorithm a while to run so, with thousands of cells, it can run for days. Second, with 10x data, you're sequencing only the 3' end of reads so there may not be enough diversity to get reliable transcript estimates. Those are the reasons I haven't made a workflow for it (but all the pieces are there if you want to give it a whirl).