[Request] Using smart-seq2-vdj without "cell" substring as cell tag

krkathuria commented 10 months ago

Checklist before submitting the issue:

[x] The issue is strongly related to the MiXCR software
[x] The issue can be reproduced with the most recent version of MiXCR
[x] There is no answer to the question in the official documentation and there is no duplicate issue in the bug tracker
[ ] Inspection of raw alignments with exportAlignmentsPretty shows that data has the expected architecture, and sample preparation artefacts are not the reason of the problem (if this is the matter of the issue)

App version: 4.6.0; built=Sat Dec 09 11:48:42 PST 2023; rev=c9fafa41fe; lib=repseqio.v4.0

Expected Result

I have generated data using Smart-Seq-2 and am trying to run "mixcr analyze" with the "smart-seq2-vdj" preset.

Actual Result

The command returned the following error: Must contain at least one Cell tag (determined as tag name starting from "cell" (like "CELL1", "Cell", "CellId", etc..)).

This is occurring because I do not have the actual substring "cell" in my fastq name as is required. Instead, I use a combination of DNA barcode sequence and other identify information to label each cell.

Request: Would it be possible to edit the preset so it can be used without having the substring "cell" in the fastq name?

Exact MiXCR commands

mixcr analyze smart-seq2-vdj --species hsa /usr/directory/B-HA_Sp40-44_p05c01r01_8_01_N19_TGGACTGGAACA_ATAAAGAACCCG-trimmed-pair1.fastq.gz /usr/directory/B-HA_Sp40-44_p05c01r01_8_01_N19_TGGACTGGAACA_ATAAAGAACCCG-trimmed-pair2.fastq.gz B-HA_Sp40-44_p05c01r01

MiXCR report files

None

mizraelson commented 10 months ago

Do you have a pair of FASTQ files for each cell?

krkathuria commented 10 months ago

Yes, I do.

mizraelson commented 10 months ago

Then, in the command you should replace a part of input file name that marks each well with {{CELL:a}}.

E.g.:

mixcr analyze smart-seq2-vdj \
    --species hsa \
    /usr/directory/{{CELL:a}}-trimmed-pair1.fastq.gz \
    /usr/directory/{{CELL:a}}-trimmed-pair2.fastq.gz \
    B-HA_Sp40-44_p05c01r01

If you run the command above, 44_p05c01r01_8_01_N19_TGGACTGGAACA_ATAAAGAACCCG will be treaded as a cell ID. Does it make sense ?

krkathuria commented 10 months ago

Hi, yes I understand. But unfortunately, since the FASTQ files for all cells are in a single directory, this may not be possible. The input between the different cells would be indistinguishable for mixcr.

I ran mixcr analyze rna-seq instead followed by exportAirr to get one AIRR formatted tsv per cell, which I concatenated into a single file for downstream analysis in scanpy. Do you see any reason this would be suboptimal to running the smart-seq2-vdj preset?

mizraelson commented 10 months ago

All files should be in the same directory. That is the whole point. MiXCR will aggregate all files for all cells and process them all together, assigning a part of file name marked by `{{CELL:a}}` to each cell. In the end you will have a clonotype table where you will see a cell id for every clone. E.g.:	Cell ID	Clone
44_p05c01r01_8_01_N19_ACGTACGTACGT_ACGTACGTACGT	CloneA
46_p05c01r01_8_01_N19_GTCAGTCAGTCA_GTCAGTCAGTCA	CloneB
47_p05c01r01_8_01_N19_TGACTGACTGAC_TGACTGACTGAC	CloneC
48_p05c01r01_8_01_N19_CAGTCAGTCAGT_CAGTCAGTCAGT	CloneD

Just run the command bellow and check the output:

mixcr analyze smart-seq2-vdj \
    --species hsa \
    /usr/directory/{{CELL:a}}-trimmed-pair1.fastq.gz \
    /usr/directory/{{CELL:a}}-trimmed-pair2.fastq.gz \
    B-HA_Sp40-44_p05c01r01

milaboratory / mixcr