milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
325 stars 79 forks source link

Will sequence platform affect success alignment rate? #290

Closed B-Li closed 6 years ago

B-Li commented 6 years ago

Hi, I am working on a RNAseq data generated from Ion Torrent Proton platform. The length of reads varies from 20ish to 200ish. For this data I got 0% TCR/IG alignment. What is the reasonable alignment quality parameters should I use? Thank you for your help!

dbolotin commented 6 years ago

Sequencing platform may, of course, affect analysis performance. At least Ion Torrent data is single-end, and has different length distribution compared to Illumina HiSeq/MiSeq, not to mention different error profile (skewed toward indels), which should not be a problem, still RNASeq analysis is very sensitive to any data parameters, as it balances on the edge of specificity / selectivity tradeoffs.

Could you please post your alignment report (from mixcr align -p rna-seq ...) and read length and quality score distributions to get some feeling on how the data looks like. Are there any adapter sequences in the data?

B-Li commented 6 years ago

Thank you for your reply! My code of mixcr is mixcr align -p rna-seq -s hsa -OallowPartialAlignments=true. The output was Alignment: 0% Alignment: 10.1% ETA: 00:18:01 Alignment: 20.1% ETA: 00:15:00 Alignment: 30.2% ETA: 00:12:29 Alignment: 40.2% ETA: 00:10:43 Alignment: 50.3% ETA: 00:09:38 Alignment: 60% ETA: 00:08:12 Alignment: 69.7% ETA: 00:06:16 Alignment: 79.7% ETA: 00:03:57 Alignment: 89.7% ETA: 00:01:53 Alignment: 99.8% ETA: 00:00:02 ============= Report ============== Analysis time: 19.11m Total sequencing reads: 14340091 Successfully aligned reads: 39 (0%) Alignment failed, no hits (not TCR/IG?): 14242355 (99.32%) Alignment failed because of absence of CDR3 parts: 5254 (0.04%) Alignment failed because of low total score: 92443 (0.64%) Overlapped: 0 (0%) Overlapped and aligned: 0 (0%) Alignment-aided overlaps: 0 (�%) Overlapped and not aligned: 0 (0%) sequence quality looks like: quality read length looks like: 2 I downloaded this toy data from SRA, and fastq-dump it. I don't know how to check whether adapter sequence was included in the fastaq files. Thank you very much for your help! :-)

mikessh commented 6 years ago

Well, there can simply be too few TCR/IG mRNA (cells) in the sample. Where does the sample come from? How much T/B-cell-related gene expression do you see (CD3/19, TR/IG loci) when running, say, Kallisto software? You can actually check our latest Nat Biotech paper to see the recall when running on RNA-seq vs RepSeq data for B-cells.

B-Li commented 6 years ago

Well it can be the reason. However, the gene-based counts in log2 scale for CD3G and CD3D are 8.73 and 6.49, respectively. Similarly, expression levels for CD8A and B are 8.06 and 6.31. It seems to me that there should be certain amount of TCR present in the sequencing data. And the sample is derived from early stage lung adenocarcinoma which is thought as immunogenic.

dbolotin commented 6 years ago

Can you please also post assemblePartial and assemble reports. align report shows 39 successful alignments, so either neither of them contain CDR3 or can be assembled into CDR3-containing contigs or other filters (like quality) dropped some results.

B-Li commented 6 years ago

The assemblePartial report looks like this: Building index: 0% Searching for overlaps: 0% ============= Report ============== Analysis time: 407.00ms Total alignments analysed: 39 Number of output alignments: 39 (100%) Alignments already with CDR3 (no overlapping is performed): 21 (53.85%) Successfully overlapped alignments: 0 (0%) Left parts with too small N-region (failed to extract k-mer): 0 (0%) Extracted k-mer diversity: 0 Dropped due to wildcard in k-mer: 0 (0%) Dropped due to too short NRegion parts in overlap: 0 (0%) Dropped overlaps with empty N region due to no complete NDN coverage: 0 (0%) Number of left-side alignments: 0 (0%) Number of right-side alignments: 18 (46.15%) Complex overlaps: 0 (0%) Over-overlaps: 0 (0%) Partial alignments written to output: 18 (46.15%)

Another round of partial assemble gives the same report. The assemble report is listed below:

Initialization: progress unknown Writing clones: 0% ============= Report ============== Analysis time: 529.00ms Final clonotype count: 3 Average number of reads per clonotype: 3.33 Reads used in clonotypes, percent of total: 10 (0%) Reads used in clonotypes before clustering, percent of total: 10 (0%) Number of reads used as a core, percent of used: 10 (100%) Mapped low quality reads, percent of used: 0 (0%) Reads clustered in PCR error correction, percent of used: 0 (0%) Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%) Reads dropped due to the lack of a clone sequence: 18 (0%) Reads dropped due to low quality: 0 (0%) Reads dropped due to failed mapping: 11 (0%) Reads dropped with low quality clones: 0 (0%) Clonotypes eliminated by PCR error correction: 0 Clonotypes dropped as low quality: 0 Clonotypes pre-clustered due to the similar VJC-lists: 0 Thank you again for your help! :-)

dbolotin commented 6 years ago

So, there seems to be 3 clones (formed from 10 reads). And 11 more reads dropped due to low quality (you can lower the quality threshold by e.g. -ObadQualityThreshold=10; see here).

Could you please provide full report with chains statistics and software version.