Error when running `mixcr analyze` with `--infer-sample-table`

mrbarbitoff commented 1 year ago

Dear mixcr developers,

I think I'm missing something obvious here, so let me apologize for it in advance. I ran into a problem when trying to analyze a set of several patient sample using mixcr v. 4.3.1. I want to run the analysis of these data as one set of sequences and then split by sample upon export, as mentioned in the documentation and some of the recent patchnotes.

I have a pair of FASTQ files for each sample, the file names look like this:

hTRA10_1.fq.gz
hTRA10_2.fq.gz
hTRA12_1.fq.gz
hTRA12_2.fq.gz

The data were generated using the MiLaboratories Human TCR RNA RACE kit.

I am trying to use file name expansion as shown in the documentation, as follows:

mixcr analyze --threads 24 \
              milab-human-tcr-rna-race-cdr3 \
              --infer-sample-table  \
              ./TRA_test/{{SAMPLE:a}}_{{R}}.fq.gz \
              TRA_results/output

The align process finishes with no errors. The log even shows some statistics:

Matched with sample table: 17055364 (95.66%)
Sample stats:
  hTRA10: 9618171 (56.39%)
  hTRA12: 7437193 (43.61%)

However, refineTagsAndSort throws the following error after UMI processing:

picocli.CommandLine$ExecutionException: Error while running command refineTagsAndSort java.lang.IllegalArgumentException: Unexpected filter tag request. Requested UMI but next level is SAMPLE.
    at com.milaboratory.mixcr.cli.Main.registerExceptionHandlers$lambda-13(SourceFile:320)
    at picocli.CommandLine.execute(CommandLine.java:2088)
    at com.milaboratory.mixcr.cli.CommandAnalyze$Cmd.run0(SourceFile:365)
    at com.milaboratory.mixcr.cli.MiXCRCommand.run(SourceFile:37)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at com.milaboratory.mixcr.cli.Main.registerLogger$lambda-27(SourceFile:409)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at com.milaboratory.mixcr.cli.Main.main(SourceFile:100)
Caused by: java.lang.IllegalArgumentException: Unexpected filter tag request. Requested UMI but next level is SAMPLE.
    at com.milaboratory.o.hg$b.getKeyExtractor(SourceFile:952)
    at com.milaboratory.mitool.refinement.gfilter.KeyedFilterContext.getKeyExtractor(SourceFile:97)
    at com.milaboratory.mitool.refinement.gfilter.GroupFilter.filter(SourceFile:297)
    at com.milaboratory.o.hg.c(SourceFile:655)
    at com.milaboratory.mixcr.cli.CommandRefineTagsAndSort$Cmd.run1(SourceFile:350)
    at com.milaboratory.mixcr.cli.MiXCRCommandWithOutputs.run0(SourceFile:69)
    at com.milaboratory.mixcr.cli.MiXCRCommand.run(SourceFile:37)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at com.milaboratory.mixcr.cli.Main.registerLogger$lambda-27(SourceFile:409)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    ... 12 more

I suppose the issue has something to do with the tagPattern which does not include any SAMPLE tag for the preset I use. I tried to fix it by switching to generic preset and attempting to set a custom tag pattern with the SAMPLE group in it but, obviously, this was not successful without changes to the read data itself.

I should also add that the analyze command finishes correctly and produces adequate results when I run it using each pair of files separately without --infer-sample-table and filename expansion.

I would be very grateful if you could tell me how I should run the analyze command in my case (if it is truly possible to analyze all samples as a single pool).

Thank you in advance for your help!

Kind regards, Yury

mizraelson commented 1 year ago

Hi, whats the idea behind analyzing all samples as a single set of sequences?

mrbarbitoff commented 1 year ago

Hi!

My idea was that a joint analysis of a set of samples may help to increase the statistical power and improve the overlap between the samples (similarly to what pooling is for in, for example, microbiome analysis pipelines). I assumed that the line "In many cases it may be useful or even required to analyze samples from different patients at once." in the documentation indicates that such an analysis strategy is possible. If this is not the case, than we'll stick with independent analysis of individual samples.

Thank you for your response!

mizraelson commented 1 year ago

Hi, Statistical analysis of TCR/BCR data is rather tricky. In general pooling samples decreases statistical significance, as the occurrence of a clone in two samples has more power then increasing the reads/UMI count for that clone if the samples are pooled together. If you can tell more about the experiment setting you have (what is the origin of different samples and what do you want to compare) I can guide you with the best approach. Usually pooling the data from separate patients is not something you wanna do.

Nevertheless: The issue with your command is a bit more complicated then just a tagPattern. MiXCR has a set of filters based on the tags provided, and both milab-human-tcr-rna-race-cdr3 and generic-tcr-amplicon-umi rely on the fact that only UMI tag is present. It is possible to do what you want using a custom preset (which we can provide as a .yaml file), where we will specify all the tags for filtering, and I can help you with that if needed, but again, it would be helpful if you can tell more about the actual case you have.

mrbarbitoff commented 1 year ago

Hi!

Thank you for the clarification. Our design is rather simple - we have two groups of samples (the same patients before and after the treatment), and we want to identify clones that are differentially abundant between the groups. However, the idea to pool the samples together actually appeared due to another strange observation which I have to explain below.

Some time ago, a colleague of mine ran the analysis of the data using the Mixcr Pro v.1.0 GUI. The results were quite nice, and we even managed to identify the clones we were looking for.

However, we wanted to perform some follow-up analysis of the data, and (as I don't have access to the GUI) I ran mixcr v.4.3 from the command line with the same time. We were badly surprised to see that the results were sibstantially different in terms of the number of clones. At the same time, nearly all clones identifiedd by mixcr v.4.3 were present in the output of the mixcr Pro. Below is the a comparison of the results for several samples:

Sample  Pro v4  Common  Overlap
hTRA10  101503  82815   82042   0.990666
hTRA11  54955   36118   35739   0.989507
hTRA12  69709   49156   48833   0.993429

The numbers shown are unique productive CDR3 counts. Both tools were ran with the default settings without any additional parameter optimization.

The numbers by itself are not that important, but the clones which showed differential abundance (and were present in almost all samples) in the mixcr Pro output were identified in only a subset of samples by the mixcr v.4.3, and no differentially abundant clones were present in the data from the newer version. This observation led us to hypothesize that some modification of the analysis might increase the sensitivity of the method and help to increase the number of clones to match the ones obtained with mixcr pro tool.

We'd be grateful if you could clarify the possible source of the difference between the two versions and, if the discrepancies can not be eliminated, which of the two outputs should be trusted.

P.S. We noticed that the tag pattern used during mixcr Pro analysis is slightly different compared to the one set in the milab-human-tcr-rna-race-cdr3 present. I tried running the tool with the generic-tcr-amplicon-umi preset and the tag pattern shown in the mixcr pro logs - with no positive changes in the number of identified clones. The tag pattern from preset: '^(R1:*) \ ^tggtatcaacgcagagt(UMI:N{14})N{10}(R2:*)' The tag pattern from mixcr Pro (modified to match the syntax): '^(R1:*) \ ^tggtatcaacgcagagt(UMI:NNNNtNNNNtNNNN)tc(R2:*)'

Yury

mrbarbitoff commented 1 year ago

Hi!

Do you need any additional information regarding the problem I described in my last comment?

Thank you in advance for your help!

Yury

mizraelson commented 1 year ago

Hi Yury, Sorry for such a late reply it's been a very busy time for us. First of all I don't recommend pooling patients data in your case as it will indeed decrease statistical power (finding similar clones abundance increase in two different samples is more promising). Mixcr Pro v.1.0 GUI uses MiXCR v.3.0.13 under the hood which is a relatively outdated version and a lot of algorithms have been upgraded since to achieve more robust results. Also the barcode correction and artificial diversity elimination works in a completely different way now. In general, I would say that the latest version is more trustworthy. What was the count of those clones? How did you calculate the differential expression? Did you normalize your data by the number of UMIs to correct for the biases ?

milaboratory / mixcr

Error when running `mixcr analyze` with `--infer-sample-table` #1081