milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
335 stars 79 forks source link

second round of assemblePartial very long #272

Closed mcieslik-mctp closed 5 years ago

mcieslik-mctp commented 7 years ago

I am running MiXCR on RNA-seq data from multiple myeloma patients, and noticed a very unpredictable run-time for some sample. Specifically the assemblePartial rounds can take 10s of hours (sometimes days, sometimes I just give up) for select samples. Not sure if this can be considered a bug, but I would appreciate suggestions how to work-around.

Alignment:

Analysis Date: Sat Aug 26 04:06:46 UTC 2017
Input file(s): ./fileHBW92cfq.gz,./filev3p7Hdfq.gz
Output file: ./fileroYz3V
Version: 2.1.2; built=Thu Apr 20 09:22:25 UTC 2017; rev=6abb137; lib=repseqio.v1.2
Analysis time: 10.78m
Command line arguments: align -f -t 16 -g -s hsa -p rna-seq -OallowPartialAlignments=true -r /job/out/MM_5323-capt-SI_17915-HMVY5BCXY_alig_csort_mixcr-aln.rep.txt ./fileHBW92cfq.gz ./filev3p7Hdfq.gz ./fileroYz3V
Total sequencing reads: 7234791
Successfully aligned reads: 916022 (12.66%)
Alignment failed, no hits (not TCR/IG?): 5421002 (74.93%)
Alignment failed because of absence of CDR3 parts: 798031 (11.03%)
Alignment failed because of low total score: 99736 (1.38%)
Overlapped: 926729 (12.81%)
Overlapped and aligned: 162864 (2.25%)
Overlapped and not aligned: 763865 (10.56%)
 chains: 13 (0%)
TRA chains: 124 (0.01%)
TRB chains: 166 (0.02%)
TRD chains: 19 (0%)
TRG chains: 201 (0.02%)
TRA,TRD chains: 3 (0%)
IGH chains: 360114 (41.82%)
IGK chains: 443639 (51.52%)
IGL chains: 56815 (6.6%)

assemblePartial:

Analysis Date: Sun Aug 27 06:19:39 UTC 2017
Input file(s): ./filembYYSV
Output file: ./fileZMPgzW
Version: 2.1.2; built=Thu Apr 20 09:22:25 UTC 2017; rev=6abb137; lib=repseqio.v1.2
Analysis time: 673.10m
Command line arguments: assemblePartial -f -r /job/out/MM_5323-capt-SI_17915-HMVY5BCXY_alig_csort_mixcr-fix.rep.txt ./filembYYSV ./fileZMPgzW
Total alignments analysed: 838104
Number of output alignments: 837547 (99.93%)
Alignments already with CDR3 (no overlapping is performed): 405937 (48.44%)
Successfully overlapped alignments: 557 (0.07%)
Left parts with too small N-region (failed to extract k-mer): 100162 (11.95%)
Dropped due to wildcard in k-mer: 0 (0%)
Dropped due to too short NRegion parts in overlap: 9913 (1.18%)
Dropped overlaps with empty N region due to no complete NDN coverage: 112 (0.01%)
Number of left-side alignments: 134237 (16.02%)
Number of right-side alignments: 222147 (26.51%)
Complex overlaps: 0 (0%)
Over-overlaps: 2709 (0.32%)
Partial alignments written to output: 431053 (51.43%)
dbolotin commented 7 years ago

Sorry for delayed response.

Thanks for reporting! The sample seem to be very enriched. We never observed such enrichment level, though, we never have samples with such extreme overall BCR expression level as multiple myeloma. MiXCR RNA-Seq pipeline was optimised for conventional RNA-Seq (with ≲0.1% of target molecules). Though, we know about low performance on heavily enriched data (we observed it on in-silica generated samples), and have plans to increase the performance for the case with better algorithms.

What is your overall goal? If the malignancy is not polyclonal, and you are interested only in top clonotypes, you can just analyse small portion of the sample, this should be much faster. Execution time with such many reads degrades to ~ O(N^2), so by decreasing the number of reads 10 fold, you should gain 100x fold decrease in execution time.

If you are still interested in TCR-s present in the sample, please see help page for the filterAlignments (mixcr filterAlignments -h).

Summarizing all, you can do something like this:

mixcr align -s hsa -p rna-seq -OallowPartialAlignments=true input_R1.fastq.gz input_R2.fastq.gz all_initial_alignments.vdjca

# Extracting all TCR alignments
mixcr filterAlignments -c TCR all_initial_alignments.vdjca tcr_initial_alignments.vdjca

# Extracting IG alignments form first 500 000 reads
mixcr filterAlignments -n 500000 -c IG all_initial_alignments.vdjca ig_initial_alignments.vdjca

...

And then separately, perform assemblePartial and assemble for each of these files.

mcieslik-mctp commented 7 years ago

Thank you for the info, I will follow your suggestion. We are doing CD138 selection and sometimes we end up with an almost pure population of B-cells. Can I use the mergeAlignments function to combine the tcr and ig subsets back?

dbolotin commented 7 years ago

Yes! This, exactly, was my thought, to suggest you merging those two files, right after I closed my laptop after writing previous message. This should simplify further analysis.

mcieslik-mctp commented 6 years ago

Thanks. I did some more tests and it appears that the -n / --limit switch does not do anything i.e. it does not limit the number of alignments returned (tested on the latest version)

e.g.:

 -> mixcr filterAlignments -f -n 100000 -c IG fileuwBa90  aaaa
Filtering: 0%
Filtering: 10.2%  ETA: 00:03:05
Filtering: 20.3%  ETA: 00:03:32
Filtering: 30.5%  ETA: 00:02:56
Filtering: 40.5%  ETA: 00:02:22
Filtering: 50.6%  ETA: 00:02:02
Filtering: 60.8%  ETA: 00:01:47
Filtering: 71.1%  ETA: 00:00:50
Filtering: 81.5%  ETA: 00:00:59
Filtering: 91.8%  ETA: 00:00:24
Written 8974304 alignments (8974814 alignments considered in total)

Am I doing something wrong?

dbolotin commented 6 years ago

Confirm, it is a bug. We will fix it soon.

dbolotin commented 6 years ago

Please try this one: http://files.milaboratory.com/mixcr/mixcr-2.1.6-SNAPSHOT.zip

mcieslik-mctp commented 6 years ago

Thanks for the super quick fix! Tests are running, I tried fixing the source code myself by introducing a simple break in the writer loop, and although it worked for small samples it hung with what looks like a race condition on my problematic ones.

mcieslik-mctp commented 6 years ago

Tested the 2.1.6-SNAPSHOT version and it works great, after limiting to 250k reads assemblePartial completes within 20min.

dbolotin commented 6 years ago

Good! I will leave this issue opened until we review the base assemblePartial procedure, there must be a better algorithm for this, one that can handle the whole dataset.