Closed ot0tot closed 6 years ago
Hi!
MiXCR assembles only sequences fully covered by your sequencing data, clones are assembled based on the specified assemblingFeatures
(see here). Only features covered by assemblingFeatures
, are available for export from clonotypes.
So, you basically should do the following (this is an example pipeline, please use parameters appropriate for your input data):
mixcr align R1.fastq.gz R2.fastq.gz alignments.vdjca
mixcr assemble -OassemblingFeatures='FR3+CDR3+FR4' alignments.vdjca clones.clns
mixcr exportClones -count -aaFeature 'FR3+CDR3+FR4' clones.clns clones.txt
Please note that all the sequences (PE reads) that don't fully cover 'FR3+CDR3+FR4'
will be discarded on the assemble step.
MiXCR 2.2 (in development now) allows to assemble contigs from several reads (e.g. like Spades or similar software, in some sense). If what I described above does not solve your problem, I can help you with pre-release MiXCR 2.2. Though, this functionality is still in development, so I can't give any guarantees that it will work correctly.
Thank you for the quick reply!
I ran mixcr assemble
using -OassemblingFeatures='FR3+CDR3+FR4'
and am getting the desired output for reads that fully cover 'FR3+CDR3+FR4'
. However, as you mentioned, >90% of reads are dropped at the assembly step due to not including a full FR3 or FR4 sequence.
Is there any way to recover these reads (or export CDR3 + as much flanking sequence as possible)? In cases where there are multiple reads of varying length, exporting only the longest assembled contig would be fine.
I would be happy to give MiXCR 2.2 a shot, understanding that it is still in development.
Thanks again!
Hi!
Yes, new mixcr can do exactly what you described.
Please download latest binary from here: http://files.milaboratory.com/mixcr/mixcr-220h.zip
The pipeline is the following:
Align according to the data type you have (please let me know if you are not sure about the parameters on this stage):
mixcr align -r report.txt ... reads_R1.fastq.gz reads_R2.fastq.gz alignments.vdjca
Assemble clonotypes by CDR3
with -a
option, to output results in a clna
format (containing clones and corresponding alignments; basically this file contains a great deal of the information from the vdjca
, so it will be even bigger):
mixcr assemble -r report.txt -a alignments.vdjca clones.clna
Assemble contigs:
mixcr assembleContigs ... clones.clna clones.clns
Here you can use:
CDR3
.-OalignedRegionsOnly=true
in this case assembled sequence will be limited by only regions covered by alignments.You can also add -OsubCloningRegion=VDJRegion
if you want MiXCR to split existing clonotypes according to the sequence outside the CDR3
(e.g. split clonotypes that differs by hypermutations). Several thresholds and error correction mechanisms should correct for most of the artificial diversity here, but this mechanism was never properly tested, so please beware. Let me know if you see artificial diversity (too many clones).
Use exportClones
to extract clones from clns
file the same way you did it previously. You should also examine the results with exportClonesPretty
, e.g.:
mixcr exportClonesPretty -n 5 clones.clns
-n
parameter tells how many top clones to output. Please let me know if you notice something strange.
Please let me know the results.
Edit: I believe this is resolved. See below.
I'm running into a problem at the assembly stage, where after assembling the initial clonotypes I get a java memory error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.milaboratory.core.tree.SequenceTreeMap$Node.getOrCreate(SequenceTreeMap.java:209)
at com.milaboratory.core.tree.SequenceTreeMap.createIfAbsent(SequenceTreeMap.java:56)
at com.milaboratory.mixcr.assembler.CloneAssembler.beginMapping(CloneAssembler.java:208)
at com.milaboratory.mixcr.assembler.CloneAssemblerRunner.run(CloneAssemblerRunner.java:86)
at com.milaboratory.mixcr.cli.ActionAssemble.go(ActionAssemble.java:116)
at com.milaboratory.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:155)
at com.milaboratory.mixcr.cli.Main.main(Main.java:131)
Running on a server with 16 GB of RAM. The .vdjca alignments I'm working with are 6-7 GB.
Any ideas?
Update: Set default to Java 9 (from Java 8) and it seems to be working.
Having trouble now at the assembleContigs step:
Assembling: 0%
Exception in thread "main" java.lang.IllegalArgumentException at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.createPointSequence(FullSeqAssembler.java:1351)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.createPointSequence(FullSeqAssembler.java:1346)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequencesByAlignments(FullSeqAssembler.java:1306)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequences(FullSeqAssembler.java:1255)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.lambda$toPointSequences$8(FullSeqAssembler.java:1188)
at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:699)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:511)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequences(FullSeqAssembler.java:1190)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.calculateRawData(FullSeqAssembler.java:986)
at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:79)
at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304)
at java.base/java.lang.Thread.run(Thread.java:844)
Is this a bug in 2.2?
Update: Some alignments make it a little further, but give a different error:
Assembling: 0%
Assembling: 10% ETA: 00:18:01
Exception in thread "Thread-1" Exception in thread "main" java.lang.AssertionError: i: 0, j: 0, columnDelta: -58, rowFactor: -232
at com.milaboratory.core.alignment.BandedMatrix.set(BandedMatrix.java:70)
at com.milaboratory.core.alignment.BandedLinearAligner.alignRightAdded0(BandedLinearAligner.java:144)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.alignLinearSeq1FromLeft(FullSeqAssembler.java:752)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.buildClone(FullSeqAssembler.java:680)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.lambda$callVariants$0(FullSeqAssembler.java:221)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1494)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:550)
at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:450)
at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.callVariants(FullSeqAssembler.java:222)
at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:104)
at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304)
at java.base/java.lang.Thread.run(Thread.java:844)
java.lang.AssertionError
at com.milaboratory.mixcr.cli.ActionAssembleContigs.go(ActionAssembleContigs.java:113)
at com.milaboratory.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:155)
at com.milaboratory.mixcr.cli.Main.main(Main.java:131)
Can you share the datasets causing the error, so I can reproduce the error? You can write an email to bolotin.dmitriy at gmail.com. Dropbox / google drive? Or I can spin up sftp server for you (please post or email your id_rsa.pub if this is preferable)?
I emailed you a link to the files the other day. Please let me know if you did not receive it.
Thanks!
Hi, yes, I received the link, looking into it right now.
I fixed both of the problems.
Please try this build: http://files.milaboratory.com/mixcr/mixcr-220j.zip
Thanks for reporting!
Hi, sorry! Seems that I found one more bug. Please wait till I fix it.
Please try this one: http://files.milaboratory.com/mixcr/mixcr-220k.zip
MiXCR 220k is working well with no errors so far. Thanks so much for the quick response and bug fixes!
One thing I have noticed is that the assembleContigs
step is very slow (specifically the Writing clones
part) using the default options outlined above.
For example, this mixcr assembleContigs
run took ~18 hours to complete:
============= Report ==============
Initial clonotype count: 308267
Final clonotype count: 308267 (100%)
Longest contig length: 402
Clustered variants: 0 (0%)
Reads in clustered variants: 0.0 (0%)
Reads in divided (newly created) clones: 0.0 (0%)
Is this expected behavior? Running on an i7-3770 CPU @ 3.40GHz, 16GB RAM.
Yes I also noticed it. I will fix it today or tomorrow.
Hi!
Thanks for the raw data! Now everything seems to be working properly.
http://files.milaboratory.com/mixcr/mixcr-220l.zip
With this version you have to reanalyse all the data from scratch, all file formats changed.
Thanks, the new version is working great and is much faster!
However, I am still running into the same issue. If I export alignments using exportClonesPretty, I get the in-frame AA translation displayed along with the coding nt sequence aligned with the germline gene segments. How can I export just the AA translation for each clone? If I use -aaFeature, much of the sequence is missing due to not having a complete FR region.
Also, is it possible to use these alignments for a phylogenetic analysis or mapping of Ab lineage? It seems like the clns files contain enough information to perform such analysis, but the data format is incompatible with any tools I have found so far. Are you aware of any options?
Thanks again for the quick responses and all of the hard work!
Mixcr v2.20l is working great, but I am still having difficulties outputting the full AA sequences of identified clones.
For example, here is some output from exportClonesPretty
:
>>> Clone id: x
>>> Abundance, reads (fraction): x (0.xxx)
L2><FR1
_ A A A Q P A M A Q S
Quality 233233111141314411 4111111155115
Target0 0 GGCTGCTGCCCAGCCGGC-GATGGCCCAGTCG 30 Score (hit score)
IGHV1S44*01 0 cAgTcG 5 -28 (432)
FR1><CDR1
_ E S G G R L V T P G T P L T L T C T V S G I D L S S
Quality 54554521346523335433415255545400050556553554555565533536344556632552253564243123
Target1 0 GGAGTCCGGGGGTCGCCTGGTCACGCCTGGGACACCCCTGACACTCACCTGCACAGTCTCTGGAATCGACCTCAGTAGCT 79 Score (hit score)
IGHV1S44*01 11 ggagtccgggggtcgcctggtAacgcctggAGGaTccctgacactcacctgcacagtctctggaatcgacctcaCtagct 90 450 (432)
CDR1><FR2
Y W M S W V R Q A P G K G L E _
Quality 2232 24542234554455555552645555644658735499924
Target1 80 ACTG-GATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAAT 124 Score (hit score)
IGHV1S44*01 91 a-tgCAatgGgctgggtccgccaggctccagggaaggggctggaat 135 450 (432)
V>
FR3><CDR3
K T S S T T V D L K M T S L T T E D T A T Y F C A R H
Quality 43422545545545544543255555423345545644555545545654555555566545354454211111111111
Target2 0 AAAACCTCGTCGACCACGGTGGATCTGAAAATGACCAGTCTGACAACCGAGGACACGGCCACCTATTTCTGTGCCAGACA 79 Score (hit score)
IGHV1S44*01 207 aaaacctcgtcgaccacggtggatctgaaaatgaccagtctgacaaccgaggacacggccacctatttctgtgc 280 370 (432)
<D D><J CDR3><FR4
N S A Y D L W G Q G T L V T I S _
Quality 11111111111111111111115666446564566556455560454405
Target2 80 TAATAGTGCTTATGACTTGTGGGGCCAAGGCACCCTGGTCACCATCTCCT 129 Score (hit score)
IGHD6-1*01 51 tgcttatg 58 40 (40)
IGHD1-1*01 45 tagtgGttat 54 34 (34)
IGHJ4*01 29 acttgtggggccCaggcaccctggtcaccGtctcct 64 148 (123)
I would like to export only the AA sequence (the line above "Quality"), resulting in a file like:
>>> Clone id: x Abundance x (0.xxx)
>AAAQPAMAQS
>ESGGRLVTPGTPLTLTCTVSGIDLSSYWMSWVRQAPGKGLE
>KTSSTTVDLKMTSLTTEDTATYFCARHNSAYDLWGQGTLVTIS
Using something like mixcr exportClones -count -aaFeature 'FR1+CDR1+FR2+CDR2FR3+CDR3+FR4' clones.clns clones.txt
with the above would only output AA sequence for CDR1 and CDR3, as they are the only complete features.
I can write a script to parse the output from exportClonesPretty
, but I'm sure there is a more simple and elegant solution.
@ot0tot thanks for the important feedback! you are right, there is still no simple way to export full length for each clone except exportClonesPretty
, and we are now implementing this; will let you know when finish.
I just created a dedicated issue for this feature, please let us know your thoughts in the comments: #389. At least vote for one of the options.
Hi, I was using mixcr-220l, trying to get the full length sequence based on alignments. However, when I'm running mixcr assembleContigs clones.clna clones.clns
it gives me below errors: Assembling: 0% Exception in thread "main" java.lang.RuntimeException: While processing clone #0 at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:107) at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.milaboratory.primitivio.PrimitivI.readObject(PrimitivI.java:104) at com.milaboratory.mixcr.basictypes.IO$VDJCAlignmentsSerializer.read(IO.java:99) at com.milaboratory.mixcr.basictypes.IO$VDJCAlignmentsSerializer.read(IO.java:81) at com.milaboratory.primitivio.PrimitivI.readObject(PrimitivI.java:87) at com.milaboratory.primitivio.PipeDataInputReader.take(PipeDataInputReader.java:29) at cc.redberry.pipe.CUtils$OPIterator.hasNext(CUtils.java:420) at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.calculateRawData(FullSeqAssembler.java:1002) at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:80) ... 2 more
The report of previous assemble step Analysis date: Thu Aug 16 17:36:23 CDT 2018 Input file(s): tcrb_alignment.vdjca Output file(s): clones.clna Version: 2.2-SNAPSHOT; built=Wed Apr 18 02:19:31 CDT 2018; rev=e40d713; lib=repseqio.v1.5 Command line arguments: assemble -r report.txt -f -a tcrb_alignment.vdjca clones.clna Analysis time: 42.30m Final clonotype count: 5506 Average number of reads per clonotype: 1819.22 Reads used in clonotypes, percent of total: 10016603 (41.49%) Reads used in clonotypes before clustering, percent of total: 12553677 (52%) Number of reads used as a core, percent of used: 9293879 (74.03%) Mapped low quality reads, percent of used: 3259798 (25.97%) Reads clustered in PCR error correction, percent of used: 2537074 (20.21%) Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%) Reads dropped due to the lack of a clone sequence: 195075 (0.81%) Reads dropped due to low quality: 152835 (0.63%) Reads dropped due to failed mapping: 6940202 (28.75%) Reads dropped with low quality clones: 0 (0%) Clonotypes eliminated by PCR error correction: 20822 Clonotypes dropped as low quality: 0 Clonotypes pre-clustered due to the similar VJC-lists: 0
Could you let me know what's the issue? Thank you!
Kevin
Please try latest build, we fixed several IO errors since 220l
:
https://s3.amazonaws.com/files.milaboratory.com/mixcr/mixcr-220m.zip
I would like to generate a list of V Region AA sequences in FASTA format to use for mass spec analysis, but running mixcr exportClones with -aaFeature set to anything besides CDR3 gives me no output. I have also tried concatenating features (eg. FR3+CDR3+FR4) and get no output.
Is it possible to use -aaFeature VRegion or -aaFeature VTranscript? What would be the best way of exporting in-frame AA translations of the assembled clones?