There are duplicates of found alleles when executing `findAlleles`

omegahh commented 1 month ago

As we discussed before, I have a lot of samples run with generic-amplicon-with-umi preset. Then I want to execute findAlleles in all assembled '.clna' files. But I got 'There are duplicates of found alleles' error which you can see in the log file. How to resolve this problem? SlurmJob_POSTIMMU.383868.log

mizraelson commented 1 month ago

Is it a regular build-in human library?

omegahh commented 1 month ago

I always use the built-in library，as I described in #1790

omegahh commented 1 month ago

@mizraelson I have about 200 BCR samples from a large cohort. So I run align at first, then refineTagsAndSort, then assemble. After this steps, I run findAlleles on all these '*.clna' files to call all potential alleles in this cohort. But I get "duplicates of found alleles" error. The related commands are shown in below:

Commands for each sample:

mixcr align -f -t 24 -Xmx160g -p generic-amplicon-with-umi -b default --species hsa --rna --tag-pattern "^cagtggtatcaacgcagagt(UMI:NNNNtNNNNtNNNN)tN{7:8}(R1:*)\^N{17}(R2:*)" --tag-max-budget 20 --rigid-left-alignment-boundary --floating-right-alignment-boundary C --assemble-clonotypes-by "[{FR1Begin:CDR1End},{CDR2Begin:FR4End}]" --json-report logs/Library.00_S4A08-UQ03-UT05.mixcr_align.json trim_demux/S4A08-UQ03-UT05_R1.fastq.gz trim_demux/S4A08-UQ03-UT05_R2.fastq.gz tmp/S4A08-UQ03-UT05.vdjca &>> logs/Library.00_S4A08.log
mixcr refineTagsAndSort -f -Xmx160g --json-report logs/Library.00_S4A08-UQ03-UT05.mixcr_refineTags.json tmp/S4A08-UQ03-UT05.vdjca trim_mixcr/S4A08-UQ03-UT05.vdjca &>> logs/Library.00_S4A08.log
mixcr assemble -f -Xmx160g --write-alignments --split-clones-by C --json-report logs/Library.00_S4A08-UQ03-UT05.mixcr_assemble.json trim_mixcr/S4A08-UQ03-UT05.vdjca trim_mixcr/S4A08-UQ03-UT05.clna &>> logs/Library.00_S4A08.log

Command for all ".clna" file:

mixcr findAlleles -Xmx512G -t 96 --force-overwrite --export-library MJBIO_HBCR_Alleles.json --export-alleles-mutations MJBIO_HBCR_Alleles.tsv --json-report MJBIO_HBCR_Alleles_log.json --output-template {file_dir_path}/{file_name}.clns *MJBIO_HBCR*/trim_mixcr/*.clna

output log:

omegahh commented 1 month ago

I actually have two questions:

I think findAlleles should be run on data of the total cohort. Because 1). alleles calling should be benefit from the sufficient data on statistical perspective, 2). all output ".clns" are realigned based on a unify allele library, thus the exported clonotypes are comparable. Am I right?
Should I run 'findAlleles' for BCR repertoires? Considering that BCR has somatic hypermutation. Is mixcr 'findAlleles' command capable in distinguishing between SNPs and SHMs?

mizraelson commented 1 month ago

I haven't been able to replicate it yet. To answer your questions:

findAlleles should only be run on samples where you expect identical alleles—typically, this would be samples from the same donor. Mixing donors might result in incorrectly assigned alleles.
Yes, it is important to run findAlleles for BCR repertoires if you plan to investigate SHMs later. Without it, you might misidentify allelic variants as SHMs. The command is specifically designed to distinguish alleles from SHMs.

omegahh commented 1 month ago

I see, so I wrongly use the 'findAlleles' command for running it on all samples from different donors. But do you think (may be develop a new command) it is important to mine all potential alleles, especially for finding de-novo alleles, on a big data? I mean If I have a large cohort which the total clonotypes are extremely large, I am willing to integrate them together to mining de-novo alleles. What's your opinion?

mizraelson commented 1 month ago

If each sample is from a separate organism, it is essential to process them separately using findAlleles. You can then aggregate all the information from the output tables into a single dataset, depending on what you intend to do with it later.Do you still see the issue if samples are processed separately?

omegahh commented 1 month ago

No, separately processing is okay. Thank you for your explanation!

milaboratory / mixcr

There are duplicates of found alleles when executing `findAlleles` #1823