milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
336 stars 79 forks source link

Question: Is it possible to export full AA sequences for aligned V regions/clones? #372

Closed ot0tot closed 6 years ago

ot0tot commented 6 years ago

I would like to generate a list of V Region AA sequences in FASTA format to use for mass spec analysis, but running mixcr exportClones with -aaFeature set to anything besides CDR3 gives me no output. I have also tried concatenating features (eg. FR3+CDR3+FR4) and get no output.

Is it possible to use -aaFeature VRegion or -aaFeature VTranscript? What would be the best way of exporting in-frame AA translations of the assembled clones?

dbolotin commented 6 years ago

Hi!

MiXCR assembles only sequences fully covered by your sequencing data, clones are assembled based on the specified assemblingFeatures (see here). Only features covered by assemblingFeatures, are available for export from clonotypes.

So, you basically should do the following (this is an example pipeline, please use parameters appropriate for your input data):

mixcr align R1.fastq.gz R2.fastq.gz alignments.vdjca
mixcr assemble -OassemblingFeatures='FR3+CDR3+FR4' alignments.vdjca clones.clns
mixcr exportClones -count -aaFeature 'FR3+CDR3+FR4' clones.clns clones.txt

Please note that all the sequences (PE reads) that don't fully cover 'FR3+CDR3+FR4' will be discarded on the assemble step.

MiXCR 2.2 (in development now) allows to assemble contigs from several reads (e.g. like Spades or similar software, in some sense). If what I described above does not solve your problem, I can help you with pre-release MiXCR 2.2. Though, this functionality is still in development, so I can't give any guarantees that it will work correctly.

ot0tot commented 6 years ago

Thank you for the quick reply!

I ran mixcr assemble using -OassemblingFeatures='FR3+CDR3+FR4' and am getting the desired output for reads that fully cover 'FR3+CDR3+FR4'. However, as you mentioned, >90% of reads are dropped at the assembly step due to not including a full FR3 or FR4 sequence.

Is there any way to recover these reads (or export CDR3 + as much flanking sequence as possible)? In cases where there are multiple reads of varying length, exporting only the longest assembled contig would be fine.

I would be happy to give MiXCR 2.2 a shot, understanding that it is still in development.

Thanks again!

dbolotin commented 6 years ago

Hi!

Yes, new mixcr can do exactly what you described.

Please download latest binary from here: http://files.milaboratory.com/mixcr/mixcr-220h.zip

The pipeline is the following:

  1. Align according to the data type you have (please let me know if you are not sure about the parameters on this stage):

    mixcr align -r report.txt ... reads_R1.fastq.gz reads_R2.fastq.gz alignments.vdjca
  2. Assemble clonotypes by CDR3 with -a option, to output results in a clna format (containing clones and corresponding alignments; basically this file contains a great deal of the information from the vdjca, so it will be even bigger):

    mixcr assemble -r report.txt -a alignments.vdjca clones.clna
  3. Assemble contigs:

    mixcr assembleContigs ... clones.clna clones.clns

    Here you can use:

    • default parameters, in this case MiXCR will try to assemble all the sequence information it can, including the regions that were not aligned with V/J/C genes outside CDR3.
    • add -OalignedRegionsOnly=true in this case assembled sequence will be limited by only regions covered by alignments.

    You can also add -OsubCloningRegion=VDJRegion if you want MiXCR to split existing clonotypes according to the sequence outside the CDR3 (e.g. split clonotypes that differs by hypermutations). Several thresholds and error correction mechanisms should correct for most of the artificial diversity here, but this mechanism was never properly tested, so please beware. Let me know if you see artificial diversity (too many clones).

  4. Use exportClones to extract clones from clns file the same way you did it previously. You should also examine the results with exportClonesPretty, e.g.:

    mixcr exportClonesPretty -n 5 clones.clns

    -n parameter tells how many top clones to output. Please let me know if you notice something strange.

Please let me know the results.

ot0tot commented 6 years ago

Edit: I believe this is resolved. See below.

I'm running into a problem at the assembly stage, where after assembling the initial clonotypes I get a java memory error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.milaboratory.core.tree.SequenceTreeMap$Node.getOrCreate(SequenceTreeMap.java:209)
    at com.milaboratory.core.tree.SequenceTreeMap.createIfAbsent(SequenceTreeMap.java:56)
    at com.milaboratory.mixcr.assembler.CloneAssembler.beginMapping(CloneAssembler.java:208)
    at com.milaboratory.mixcr.assembler.CloneAssemblerRunner.run(CloneAssemblerRunner.java:86)
    at com.milaboratory.mixcr.cli.ActionAssemble.go(ActionAssemble.java:116)
    at com.milaboratory.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:155)
    at com.milaboratory.mixcr.cli.Main.main(Main.java:131)

Running on a server with 16 GB of RAM. The .vdjca alignments I'm working with are 6-7 GB.

Any ideas?

Update: Set default to Java 9 (from Java 8) and it seems to be working.

ot0tot commented 6 years ago

Having trouble now at the assembleContigs step:

Assembling: 0%
Exception in thread "main" java.lang.IllegalArgumentException at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.createPointSequence(FullSeqAssembler.java:1351)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.createPointSequence(FullSeqAssembler.java:1346)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequencesByAlignments(FullSeqAssembler.java:1306)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequences(FullSeqAssembler.java:1255)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.lambda$toPointSequences$8(FullSeqAssembler.java:1188)
    at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
    at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
    at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:699)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:511)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.toPointSequences(FullSeqAssembler.java:1190)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.calculateRawData(FullSeqAssembler.java:986)
    at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:79)
    at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304)
    at java.base/java.lang.Thread.run(Thread.java:844)

Is this a bug in 2.2?

Update: Some alignments make it a little further, but give a different error:

Assembling: 0%
Assembling: 10%  ETA: 00:18:01
Exception in thread "Thread-1" Exception in thread "main" java.lang.AssertionError: i: 0, j: 0, columnDelta: -58, rowFactor: -232
    at com.milaboratory.core.alignment.BandedMatrix.set(BandedMatrix.java:70)
    at com.milaboratory.core.alignment.BandedLinearAligner.alignRightAdded0(BandedLinearAligner.java:144)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.alignLinearSeq1FromLeft(FullSeqAssembler.java:752)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.buildClone(FullSeqAssembler.java:680)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.lambda$callVariants$0(FullSeqAssembler.java:221)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1494)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:550)
    at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
    at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:450)
    at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.callVariants(FullSeqAssembler.java:222)
    at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:104)
    at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304)
    at java.base/java.lang.Thread.run(Thread.java:844)
java.lang.AssertionError
    at com.milaboratory.mixcr.cli.ActionAssembleContigs.go(ActionAssembleContigs.java:113)
    at com.milaboratory.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:155)
    at com.milaboratory.mixcr.cli.Main.main(Main.java:131)
dbolotin commented 6 years ago

Can you share the datasets causing the error, so I can reproduce the error? You can write an email to bolotin.dmitriy at gmail.com. Dropbox / google drive? Or I can spin up sftp server for you (please post or email your id_rsa.pub if this is preferable)?

ot0tot commented 6 years ago

I emailed you a link to the files the other day. Please let me know if you did not receive it.

Thanks!

dbolotin commented 6 years ago

Hi, yes, I received the link, looking into it right now.

dbolotin commented 6 years ago

I fixed both of the problems.

Please try this build: http://files.milaboratory.com/mixcr/mixcr-220j.zip

Thanks for reporting!

dbolotin commented 6 years ago

Hi, sorry! Seems that I found one more bug. Please wait till I fix it.

dbolotin commented 6 years ago

Please try this one: http://files.milaboratory.com/mixcr/mixcr-220k.zip

ot0tot commented 6 years ago

MiXCR 220k is working well with no errors so far. Thanks so much for the quick response and bug fixes!

One thing I have noticed is that the assembleContigs step is very slow (specifically the Writing clones part) using the default options outlined above.

For example, this mixcr assembleContigs run took ~18 hours to complete:

============= Report ==============
Initial clonotype count: 308267
Final clonotype count: 308267 (100%)
Longest contig length: 402
Clustered variants: 0 (0%)
Reads in clustered variants: 0.0 (0%)
Reads in divided (newly created) clones: 0.0 (0%)

Is this expected behavior? Running on an i7-3770 CPU @ 3.40GHz, 16GB RAM.

dbolotin commented 6 years ago

Yes I also noticed it. I will fix it today or tomorrow.

dbolotin commented 6 years ago

Hi!

Thanks for the raw data! Now everything seems to be working properly.

http://files.milaboratory.com/mixcr/mixcr-220l.zip

With this version you have to reanalyse all the data from scratch, all file formats changed.

ot0tot commented 6 years ago

Thanks, the new version is working great and is much faster!

However, I am still running into the same issue. If I export alignments using exportClonesPretty, I get the in-frame AA translation displayed along with the coding nt sequence aligned with the germline gene segments. How can I export just the AA translation for each clone? If I use -aaFeature, much of the sequence is missing due to not having a complete FR region.

Also, is it possible to use these alignments for a phylogenetic analysis or mapping of Ab lineage? It seems like the clns files contain enough information to perform such analysis, but the data format is incompatible with any tools I have found so far. Are you aware of any options?

Thanks again for the quick responses and all of the hard work!

ot0tot commented 6 years ago

Mixcr v2.20l is working great, but I am still having difficulties outputting the full AA sequences of identified clones.

For example, here is some output from exportClonesPretty:

>>> Clone id: x
>>> Abundance, reads (fraction): x (0.xxx)

                        L2><FR1                  
              _ A  A  A  Q  P  A   M  A  Q  S    
    Quality   233233111141314411 4111111155115   
    Target0 0 GGCTGCTGCCCAGCCGGC-GATGGCCCAGTCG 30  Score (hit score)
IGHV1S44*01 0              cAgTcG              5   -28 (432)

                                                                        FR1><CDR1                 
               _ E  S  G  G  R  L  V  T  P  G  T  P  L  T  L  T  C  T  V  S  G  I  D  L  S  S     
    Quality    54554521346523335433415255545400050556553554555565533536344556632552253564243123   
    Target1  0 GGAGTCCGGGGGTCGCCTGGTCACGCCTGGGACACCCCTGACACTCACCTGCACAGTCTCTGGAATCGACCTCAGTAGCT 79  Score (hit score)
IGHV1S44*01 11 ggagtccgggggtcgcctggtAacgcctggAGGaTccctgacactcacctgcacagtctctggaatcgacctcaCtagct 90  450 (432)

                CDR1><FR2                                        
               Y  W   M  S  W  V  R  Q  A  P  G  K  G  L  E _    
    Quality    2232 24542234554455555552645555644658735499924    
    Target1 80 ACTG-GATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAAT 124  Score (hit score)
IGHV1S44*01 91 a-tgCAatgGgctgggtccgccaggctccagggaaggggctggaat 135  450 (432)

                                                                                        V>          
                                                                                 FR3><CDR3          
                 K  T  S  S  T  T  V  D  L  K  M  T  S  L  T  T  E  D  T  A  T  Y  F  C  A  R  H    
    Quality     43422545545545544543255555423345545644555545545654555555566545354454211111111111    
    Target2   0 AAAACCTCGTCGACCACGGTGGATCTGAAAATGACCAGTCTGACAACCGAGGACACGGCCACCTATTTCTGTGCCAGACA 79   Score (hit score)
IGHV1S44*01 207 aaaacctcgtcgaccacggtggatctgaaaatgaccagtctgacaaccgaggacacggccacctatttctgtgc       280  370 (432)

                    <D    D><J CDR3><FR4                            
                N  S  A  Y  D  L  W  G  Q  G  T  L  V  T  I  S _    
   Quality    11111111111111111111115666446564566556455560454405    
   Target2 80 TAATAGTGCTTATGACTTGTGGGGCCAAGGCACCCTGGTCACCATCTCCT 129  Score (hit score)
IGHD6-1*01 51       tgcttatg                                     58   40 (40)
IGHD1-1*01 45    tagtgGttat                                      54   34 (34)
  IGHJ4*01 29               acttgtggggccCaggcaccctggtcaccGtctcct 64   148 (123)

I would like to export only the AA sequence (the line above "Quality"), resulting in a file like:

>>> Clone id: x         Abundance x (0.xxx)
>AAAQPAMAQS
>ESGGRLVTPGTPLTLTCTVSGIDLSSYWMSWVRQAPGKGLE
>KTSSTTVDLKMTSLTTEDTATYFCARHNSAYDLWGQGTLVTIS

Using something like mixcr exportClones -count -aaFeature 'FR1+CDR1+FR2+CDR2FR3+CDR3+FR4' clones.clns clones.txt with the above would only output AA sequence for CDR1 and CDR3, as they are the only complete features.

I can write a script to parse the output from exportClonesPretty, but I'm sure there is a more simple and elegant solution.

PoslavskySV commented 6 years ago

@ot0tot thanks for the important feedback! you are right, there is still no simple way to export full length for each clone except exportClonesPretty, and we are now implementing this; will let you know when finish.

dbolotin commented 6 years ago

I just created a dedicated issue for this feature, please let us know your thoughts in the comments: #389. At least vote for one of the options.

kevin199011 commented 6 years ago

Hi, I was using mixcr-220l, trying to get the full length sequence based on alignments. However, when I'm running mixcr assembleContigs clones.clna clones.clns

it gives me below errors: Assembling: 0% Exception in thread "main" java.lang.RuntimeException: While processing clone #0 at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:107) at cc.redberry.pipe.blocks.ParallelProcessor$Worker.run(ParallelProcessor.java:304) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.milaboratory.primitivio.PrimitivI.readObject(PrimitivI.java:104) at com.milaboratory.mixcr.basictypes.IO$VDJCAlignmentsSerializer.read(IO.java:99) at com.milaboratory.mixcr.basictypes.IO$VDJCAlignmentsSerializer.read(IO.java:81) at com.milaboratory.primitivio.PrimitivI.readObject(PrimitivI.java:87) at com.milaboratory.primitivio.PipeDataInputReader.take(PipeDataInputReader.java:29) at cc.redberry.pipe.CUtils$OPIterator.hasNext(CUtils.java:420) at com.milaboratory.mixcr.assembler.fullseq.FullSeqAssembler.calculateRawData(FullSeqAssembler.java:1002) at com.milaboratory.mixcr.cli.ActionAssembleContigs.lambda$go$1(ActionAssembleContigs.java:80) ... 2 more

The report of previous assemble step Analysis date: Thu Aug 16 17:36:23 CDT 2018 Input file(s): tcrb_alignment.vdjca Output file(s): clones.clna Version: 2.2-SNAPSHOT; built=Wed Apr 18 02:19:31 CDT 2018; rev=e40d713; lib=repseqio.v1.5 Command line arguments: assemble -r report.txt -f -a tcrb_alignment.vdjca clones.clna Analysis time: 42.30m Final clonotype count: 5506 Average number of reads per clonotype: 1819.22 Reads used in clonotypes, percent of total: 10016603 (41.49%) Reads used in clonotypes before clustering, percent of total: 12553677 (52%) Number of reads used as a core, percent of used: 9293879 (74.03%) Mapped low quality reads, percent of used: 3259798 (25.97%) Reads clustered in PCR error correction, percent of used: 2537074 (20.21%) Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%) Reads dropped due to the lack of a clone sequence: 195075 (0.81%) Reads dropped due to low quality: 152835 (0.63%) Reads dropped due to failed mapping: 6940202 (28.75%) Reads dropped with low quality clones: 0 (0%) Clonotypes eliminated by PCR error correction: 20822 Clonotypes dropped as low quality: 0 Clonotypes pre-clustered due to the similar VJC-lists: 0

Could you let me know what's the issue? Thank you!

Kevin

dbolotin commented 6 years ago

Please try latest build, we fixed several IO errors since 220l: https://s3.amazonaws.com/files.milaboratory.com/mixcr/mixcr-220m.zip