milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
323 stars 78 forks source link

Export all aligned reads #303

Closed DanteBortone closed 6 years ago

DanteBortone commented 6 years ago

Hi,

Is there a way to export all aligned reads from the vdjca file without needing to specify all of the clone ids or index file?

Thanks, Dante

suntaosimon commented 6 years ago

mixcr exportAlignments

dbolotin commented 6 years ago

@DanteBortone, for now it is not possible. I consider it a feature request, we'll include this functionality in 2.2. Could you please describe in more details your use-case.

DanteBortone commented 6 years ago

Hi Dmitry,

If someone wants to get all of the reads that were aligned, they must currently have the index_file and the clone ids, grab all of the clone ids and input all of them in the cloneid argument of exportReadsForClones. Then they would get a fastq file for each clone and need to concatentate all of the fastqs. They could also output the unaligned reads and subtract them from the original fastq. If all someone wants is every fastq that was aligned this is a bit cumbersome. If someone doesn't have the index_file, clone ids or unaligned reads then some MiXCR steps would need to be repeated. All of the fastqs are in the vdjca (assuming --save_reads was used), so it shouldn't be impossible to grab them directly from the vdjca without using index_files or cloneids.

It could be as easy as: MiXCR exportReadsForClones alignments.vdjca.gz all reads.fastq.gz

We are finding that our diversity metrics are strongly affected by the number of total reads. Algorithms that are meant to adjust diversity metrics for the number of observed individuals haven't worked well in our hands. I think this in large part because more reads provides MiXCR with more information to define new clone species. In this case fewer reads doesn't just mean lower clone counts (i.e. fewer individuals), it also means fewer species. Diversity corection algorithms that I've seen don't take this into account. A correction that has worked well for us is subsetting the aligned reads, rerunning MiXCR and then recalculating the diversity metrics. The change in the diversity metrics over different read counts is surprisingly linear and can be used to predict the diversity over a wide range of dilutions.

Reapplying this to all of our previously run data could be a bit of a mess so I was hoping for an easy way to grab the aligned reads from the vdjca.

Thank you for considering this as a new feature.

Best, Dante

Edited to correct name spelling. My apologies. I previously went by how it was spelled on your nature methods paper.

mikessh commented 6 years ago

We are finding that our diversity metrics are strongly affected by the number of total reads. Algorithms that are meant to adjust diversity metrics for the number of observed individuals haven't worked well in our hands. I think this in large part because more reads provides MiXCR with more information to define new clone species. In this case fewer reads doesn't just mean lower clone counts (i.e. fewer individuals), it also means fewer species. Diversity corection algorithms that I've seen don't take this into account. A correction that has worked well for us is subsetting the aligned reads, rerunning MiXCR and then recalculating the diversity metrics.

Ideally diversity metrics should be computed when all samples are normalized to the same number of reads, these metrics should be bootstrapped to get mean/sd of the diversity. This is a commonly used strategy for computing the repertoire diversity, as it is very hard to model the actual frequency distribution (see this work by Thierry Mora and Aleksandra Walczak).

The change in the diversity metrics over different read counts is surprisingly linear and can be used to predict the diversity over a wide range of dilutions.

This means that your samples contain a huge portion of naive T-cells, the number of observed naive T-cell species should increase almost linearly for possible sampling depth range as the real diversity of these species is close to the total number of T-cells in an organism (again see Mora and Walczak 2016).

PS. Its still hard to understand what is the problem with current setup - you can simply take the output of exportClones and sum the count field to get the total number of reads. The number of unaligned reads also accounts PhiX reads, artefacts, etc, so why would it matter?

DanteBortone commented 6 years ago

Thank you for your input and the Mora and Walczak reference. Seems like a great resource for what is giving us troubles at the moment.

PS. Its still hard to understand what is the problem with current setup - you can simply take the output of exportClones and sum the count field to get the total number of reads. The number of unaligned reads also accounts PhiX reads, etc, so why would it matter?

I was looking for an easy way to get a fastq with the aligned reads not just the counts. This can be done with my current setup. I was just hoping I had overlooked an easier way to get the fastqs with all of the aligned reads without having to either specify all of the clones ids or subtract out the unaligned reads.

dbolotin commented 6 years ago

As Mike said, you definitely need to normalise (downsample) the data to calculate any diversity measures. If it is your final goal, doing it through going back to aligned raw reads seems to be an overkill, downsampling can be easily performed on the final list of clonotypes, by resampling their read counts and removing zeros using e.g. multinomial distribution (see scipy or R docs).

Anyway, we will include exportReads action in the next release, that will extract reads directly from .vdjca file.

DanteBortone commented 6 years ago

Thank you for adding this as a new feature. It will streamline things quite a bit for me.

Jackknifing the clone counts hasn't adequately adjusted for lower reads counts in our hands. The attached graph shows the entropy of the same sample which was subset by either initial reads (red points) or post-MiXCR clone count (black with boxwhisker and individual subsampling results, n=100). 'Observations' on the x axis refers to read/clone counts.

Compared to read subsetting, subsampling clone counts always overestimates the diversity. Zero-count clones would not be included in these measurements.

My naive explanation for this is that MiXCR might be more likely to categorize a read as a new species (rather than a pcr or sequencing error of an existing species) if it has more reads to support the creation of that new species.

Shannon_Entropy_by_Jackknife.pdf

mikessh commented 6 years ago

Hi, not sure what you mean by jackknifing. If I'm not mistaking, it means computing the statistic by leaving out a single observation.

Can you try the exact downsampling from VDJtools? The one performed in CalcDiversityStats module described here: http://vdjtools-doc.readthedocs.io/en/latest/diversity.html#calcdiversitystats. You can set the number of reads with --downsample-to option and the number of trials with --resample-trials option. This will run exact resampling, i.e. random selection of N reads from a flat table of all reads in your sample.


The other thing here is that MIXCR performs error correction.

For example say we are at 10^6 reads and have a clonotype with 100 reads and 100 erroneous subvariants with a size of 1 read, all differ by single substitution and are unique. In this setup frequency-based correction with a 1:20 parent:to:child ration will remove/aggregate them.

Next, imagine we've downsampled 10x: we now have 10^5 reads, our clonotype has 10 reads and we have 10 errors each represented by a single read. In this case none of the variants will be removed under 1:20 threshold.

Of course the situation gets more complex as the sequencing quality is considered, but the idea is the same.

DanteBortone commented 6 years ago

Sorry, I was using the term incorrectly. 'Downsampling' properly describes what was done on the both the reads and the clone counts.

I will the VDJ tools a try.

DanteBortone commented 6 years ago

VDJtools shows the same thing. Exact downsampling by clone counts (I don't think VDJtools is capable of looking at the reads directly) always provides substantially higher diversity estimates than rerunning MiXCR on downsampled fastqs. It may seem like overkill, but if it's what we have to do to normalize for read counts then it's what we have to do.

We are currently simulating TCR repertoires and reads in which the ground truth is known. Analyzing repertoires in which we know the true result seems like the only real way to compare normalization approaches.

Thank you, Mike and Dmitry, for your time on this issue. The discussion has been very helpful for me.

InesdeSantiago commented 6 years ago

Hello. I am using the latest version of MixCR (2.10) to align the reads and then downsample them I align to IGH chain only for now and get the following report:

Command line arguments: align --save-description --save-reads -r align_log.txt -OallowPartialAlignments=true -t 10 -p rna-seq -s hsa -c IGH L001_1.fastq L001_2.fastq IGH.alignments.vdjca Analysis time: 11.70m Total sequencing reads: 56570555 Successfully aligned reads: 8547 (0.02%) Alignment failed, no hits (not TCR/IG?): 56517931 (99.91%) Alignment failed because of absence of CDR3 parts: 18023 (0.03%) Alignment failed because of low total score: 26054 (0.05%) Overlapped: 284241 (0.5%) Overlapped and aligned: 59 (0%) Alignment-aided overlaps: 39 (66.1%) Overlapped and not aligned: 284182 (0.5%) IGH chains: 8547 (100%)

Then I down-sample the reads by parsing the alignments.txt file (after exportAlignments) and I take only 3500 reads, I re-run MixCR aligner hoping to have 100% alignment rate but I only get this:

Command line arguments: align -OallowPartialAlignments=true -p rna-seq -r align_log.txt -t 10 -s hsa -c IGH IGH.3500_pe1.downsampled.fq.gz IGH.3500_pe2.downsampled.fq.gz success_IGH.alignments.vdjca Analysis time: 1.56s Total sequencing reads: 3500 Successfully aligned reads: 459 (13.11%) Alignment failed, no hits (not TCR/IG?): 8 (0.23%) Alignment failed because of absence of CDR3 parts: 972 (27.77%) Alignment failed because of low total score: 2061 (58.89%) Overlapped: 0 (0%) Overlapped and aligned: 0 (0%) Alignment-aided overlaps: 0 (?%) Overlapped and not aligned: 0 (0%) IGH chains: 459 (100%)

I am not sure why I have so many failed alignment? What are the parameter I can tune to prevent failed alignments. I noticed that by setting -OminSumScore=0 I can recover all reads that 'failed' due to total low score. But not sure if it is ok to do this? will I meed up the assembly.. and what about the CDR3 failed reads?

PoslavskySV commented 6 years ago

@InesdeSantiago could you please paste exact commands you used? For align, exportAlignments and for the command you use to pick 3500 reads. That will help to understand what is going on.

InesdeSantiago commented 6 years ago

@PoslavskySV sure, here it is:

1st alignment: mixcr align --save-description --save-reads -r first.align_log.txt -OallowPartialAlignments=true -t 10 -p rna-seq -s hsa -c IGH f1.fastq f2.fastq first.alignments.vdjca

export alignments: mixcr exportAlignments -readId -sequence -quality -descrR1 -descrR2 first.alignments.vdjca first.alignments.txt

**pick 3500 reads*** python align2fastq.py --input first.alignments.txt --nreads 3500 --skip 1000 --seed 1 With this command, I generate two fastq files (pair 1 and pair2) and then I choose randomly 3500 lines (--nreads) from. I choose to ignore the first 1000 lines (--skip). The --seed 1 part is to make sure I use the same random seed to keep pairing of fastq files. My Python script is here: https://github.com/InesdeSantiago/align_parser/blob/master/mixcr_align_txt_2_fastq_pairedEnd.py

2nd alignment mixcr align -OallowPartialAlignments=true -p rna-seq -r second.align_log.txt -t 10 -s hsa -c IGH first.3500_pe1.downsampled.fq.gz first.3500_pe2.downsampled.fq.gz second.alignments.vdjca

PoslavskySV commented 6 years ago

@InesdeSantiago I think I know where the problem is; in order to confirm, could you please re-run your 2nd alignment with "-OreadsLayout=Unknown" option and paste the summary here?

InesdeSantiago commented 6 years ago

@PoslavskySV Wow , that seemed to work quite well:

mixcr align -OallowPartialAlignments=true -p rna-seq -r second.align_log.txt -t 10 -s hsa -c IGH -OreadsLayout=Unknown first.3500_pe1.downsampled.fq.gz first.3500_pe2.downsampled.fq.gz second.alignments.vdjca

============= Report ============== Analysis time: 9.83s Total sequencing reads: 3500 Successfully aligned reads: 3478 (99.37%) Alignment failed, no hits (not TCR/IG?): 1 (0.03%) Alignment failed because of absence of CDR3 parts: 7 (0.2%) Alignment failed because of low total score: 14 (0.4%) Overlapped: 1 (0.03%) Overlapped and aligned: 1 (0.03%) Alignment-aided overlaps: 1 (100%) Overlapped and not aligned: 0 (0%) IGH chains: 3478 (100%)

I Will use that option.. Does it mean that my Python parser is not orienting the reads correctly? If i fix python, then it should work? Thanks!! ines

PoslavskySV commented 6 years ago

@InesdeSantiago that is our omission: -sequence in exportAlignments actually exports not initial reads, but what is called targets in MiXCR (generally, collinear paired reads, but may be single read which is a merged R1+R2 etc.) --- this is not documented and the name -sequence is bad for this thing (will fix this in 2.2). Anyway, to export exactly the original reads (as in initial .fastq files) from .vdjca you need to use

mixcr exportReads input.vdjca R1.fastq R2.fastq

(don't use exportAlignments -sequence)

SalehTavil commented 6 years ago

Hello everybody, I am using MIXCR to analyse my TCR-sequencing data. I have a question: Is it possibly to randomly normalize my data (e.g. 10.000 reads) after I assembled my clones? Best wishes

Saleh Tavil

dbolotin commented 6 years ago

MiXCR has no option to downsample datasets. What can be done (and required for the task) is to export all reads that were assembled into clonotypes. This feature is not released yet, and requires pre-release MiXCR 2.2, for which I cant guarantee stability for now, however you can set up a pipeline with pre-release version and then substitute mixcr with new release version.

The pipeline is the following:

  1. Align sequences according to the data type you have with --save-reads option (as with old mixcr):

    mixcr align -r report.txt ... reads_R1.fastq.gz reads_R2.fastq.gz alignments.vdjca
  2. Assemble clonotypes with -a option, to output result in a clna format (containing clones and corresponding alignments; basically this file contains all information from the vdjca, so it will be even bigger):

    mixcr assemble -r report.txt -a alignments.vdjca clones.clna
  3. Export reads with exportReadsForClones action. Please see help for the command with mixcr exportReadsForClones -h.

    mixcr exportReadsForClones clones.clna reads_from_clones.fastq

    this will export all the reads assigned to one of the clonotypes.

You can then downsample the fastq file and you will get the result you want.

Here is the link to the lates pre-release mixcr 2.2: https://s3.amazonaws.com/files.milaboratory.com/mixcr/mixcr-220n.zip

P.S. Please don't comment on the closed issues, better open new one.

Best.

sheenaseven commented 3 years ago

Hello. I am using the version of MixCR (3.0.13) to align the reads. I got the report using (mixcr align -p kAligner2 -s HomoSapiens 1.fastq.gz 2.fastq.gz align.vdjca), Analysis time: 32.88m Total sequencing reads: 24909711 Successfully aligned reads: 24437628 (98.1%) Paired-end alignment conflicts eliminated: 305432 (1.23%) Alignment failed, no hits (not TCR/IG?): 133112 (0.53%) Alignment failed because of absence of V hits: 6876 (0.03%) Alignment failed because of absence of J hits: 185038 (0.74%) No target with both V and J alignments: 147057 (0.59%) Overlapped: 24331081 (97.68%) Overlapped and aligned: 24047416 (96.54%) Alignment-aided overlaps: 716006 (2.98%) Overlapped and not aligned: 283665 (1.14%) V gene chimeras: 16463 (0.07%) J gene chimeras: 788 (0%) TRA chains: 1 (0%) TRB chains: 24437627 (100%) Realigned with forced non-floating bound: 2589272 (10.39%) Realigned with forced non-floating right bound in left read: 191447 (0.77%) Realigned with forced non-floating left bound in right read: 191447 (0.77%)

  1. But I checked the 1.fastq with (echo $(cat 1.fastq.gz|wc -l)/4|bc =702330; echo $(cat 2.fastq.gz|wc -l)/4|bc =666518;) totally we have 1368848 reads. But the report showed 24909711 reads far larger than the actual 1368848. So what's the meaning of Total sequencing reads on the alignment module???'

  2. Then I used (mixcr exportAlignments align.vdjca align.txt) to better understand the align details. For my output, I saw

ATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTACTGTAGCGGGAGGGCAGTCTGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGGTA (141 bases)

but my fastq showed each read only has 100 bases. Why I got this results? Hope to receive your constructive comments and instructions.

Thank! image

PoslavskySV commented 3 years ago

@sheenaseven you should use zcat instead of cat for gzipped fastq data:

zcat 1.fastq.gz | wc -l

On OS X use

cat 1.fastq.gz | zcat
sheenaseven commented 3 years ago

@PoslavskySV Really appreciate your reply. I changed the "cat" to "zcat". Finally, I received 26702072 reads (1.fastq.gz and 2.fastq.gz), but totally I got Total sequencing reads: 24909711 according to mixcr. Why the numbers are different?

PoslavskySV commented 3 years ago

Regarding

but my fastq showed each read only has 100 bases.

this is because R1 and R2 were overlapped (according to report almost all of your reads overlap).

Regarding total number of reads, this is strange. Could you paste exact command you use to check the number of reads?

sheenaseven commented 3 years ago

@PoslavskySV I think the question has already been solved since the total sequences equal individual R1 or R2 sequences. Also, I have a question about the vdjca file, do you have ideas to visualize the vdjca file to see the alignment result? Or do you have suggestions on how to change the vajca file to bam file? I want to directly check the bam file to see the TCR alignment result. image

PoslavskySV commented 3 years ago

@sheenaseven Regarding visualization of alignments, you can use exportAlignmentsPretty, see https://mixcr.readthedocs.io/en/master/export.html#exporting-well-formatted-alignments-for-manual-inspection

sheenaseven commented 3 years ago

Thank you @PoslavskySV. For the exportAlignmentsPretty (--limit 10), how are these 10 read ids selected? Did mixcr choose the top ten reads with the largest alignment score in the first 1000? Is there an order for the 1000? image

PoslavskySV commented 3 years ago

Did mixcr choose the top ten reads with the largest alignment score in the first 1000?

no, just sequential 10 alignments

sheenaseven commented 3 years ago

Yes, how could I show all the vdjca rows (ie. how could I know how many read ids in vdjca file)? I run the code (mixcr exportAlignmentsPretty --verbose --skip 1000 --limit 10 input.vdjca test.txt), but my result only included sequence 0, not sequence 1. My reads are paired-end and each read includes 100 bases. I found the length of sequence 0 is 115, so the sequence 0 integrated the read 1 and 2 ? image

image