When trying to use technology STORMSEQ - error

yeroslaviz commented 1 month ago

Describe the issue I'm running the kb count command for my data set from the pico kit (Takara) with the structure on read 2 I have 8 bp UMI followedby a 6 bp linker - cDNA.

What is the exact command that was run?

I tried both the command

    kb count --h5ad --workflow standard --sum total -w None\
         -i ${INDEX_PATH}/index.idx -g ${INDEX_PATH}/t2g.txt \
         -x STORMSEQ \
         -o ${OUTPUT_DIR}/$base/ \
         --filter bustools -t 12 \
         ${FASTQ_DIR}/$base\_L001_R1_001.fastq.gz ${FASTQ_DIR}/$base\_L001_R2_001.fastq.gz

as well as the command with a manually given -x parameter:

    kb count --h5ad --workflow standard --sum total -w None\
         -i ${INDEX_PATH}/index.idx -g ${INDEX_PATH}/t2g.txt \
         -x 0,0,0:1,0,8,1,14,20:0,0,0 \
         -o ${OUTPUT_DIR}/$base/ \
         --filter bustools -t 12 \
         ${FASTQ_DIR}/$base\_L001_R1_001.fastq.gz ${FASTQ_DIR}/$base\_L001_R2_001.fastq.gz

When running the first command using the STROMSEQ technology I get the error:

Error: technology string must contain two colons (:), none found: "STORMSEQ"
Unable to create technology: STORMSEQ
kallisto 0.50.1

Even though I can see the technology in the kb --list output.

When executing the second command, I do manage to create a count matrix, but when trying to filter it throws the following error:

ValueError: Observations annot. `obs` must have number of rows of `X` (3229665), but has 0 rows.

The verbose output of the second command is attached in the file kb_counts.txt

For that I have two questions:

Why can't I use the STORMSEQ technology
Is the -x parameters in the second version correct to replace the STROMSEQ parameters?

thanks

Yenaled commented 1 month ago

It's because of a bug in one of kb-python's dependencies that you can't supply -x STORMSEQ. I haven't really prioritized fixing this bug since you can just as easily supply the technology string.
The technology string is incorrect. I think STORMSEQ has the reads you want to map in the R1 file as well as after the linker (position 14) of the R2 file, and there are no barcodes. The technology string should then be supplied as -x " -1,0,0:1,0,8:0,0,0,1,14,0" and you supply --parity=paired because it's paired-end (I think?).

yeroslaviz commented 1 month ago

Thanks for the clarification. My data is not really STROMSEQ, but in one of the issues here I have found someone suggesting to use this technology, when you have the pico kit from takara.

In this Kit we have the UMI (8 nt long) at the start of R2 followed by a 6 nt linker, I want to discard. So I want to quantify R1 as well as R2 w.o. the first 14 nt. The image below show the two reads and how they are constructed.

When I run your suggestion though -x "-1,0,0:1,0,8:0,0,0,1,14,0", I get the error kb count: error: argument -x: expected one argument. Am I doing something wrong?

Yenaled commented 1 month ago

Notice the space I had immediately following after the first quotation mark? You need that.

yeroslaviz commented 1 month ago

now it seems to run.

thanks a lot.

pachterlab / kb_python

When trying to use technology STORMSEQ - error #261