Closed rspreafico closed 8 years ago
Hello and thanks!
First, I've added an ImmunoseqV3 parser in the develop format. I'm unsure whether the latest Immunoseq data is parseable with standard Immunoseq parser. The develop version has also a completely re-written CalcCdrAaProfile and Annotate routines, but the docs are only those provided in the console with -h
argument.
VJ-junc option (as well as "NDN size" in basic statistics) are always computed for NTs/AAs between V segment end and J segment start, no matter whether D segment was identified or not. The V-D and D-J (and "insert size" in basic statistics) are only computed for clonotypes with identified D segment, and those without D segment are skipped.
Thank you! This is very helpful!
I downloaded 1.1.2-SNAPSHOT and noticed that there is an ImmunoSeqV3 format now. I tried the previous v1.1.1. with ImmunoSeq format, and the new one with both ImmunoSeq and ImmunoSeqV3 format. The ImmunoSeqV3 does not parse my V3 files. The ImmunoSeq format (for both 1.1.1 and 1.1.2-SNAPSHOT) does parse my V3 files, however many clonotypes get discarded.
[Mon Sep 19 17:20:06 PDT 2016 Convert] Reading sample(s)
[Mon Sep 19 17:20:06 PDT 2016 Convert] 12 sample(s) loaded
[Mon Sep 19 17:20:06 PDT 2016 SampleStreamConnection] Loading sample 4651_Day_0
[WARNING] Some of the essential fields are bad/missing for the following clonotype string (displaying first 5 warnings)
NO_CDR3AA,NO_V:
TTCACATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCCAGGGGAACACTGAAGCTTTCTTTGGACAA 138 9.228732119331519E-4 37 TCRBV03 TCRBV03 TCRBV03-01,TCRBV03-02 TCRBD01-01*01 TCRBD01 TCRBD01-01 01 TCRBJ01-01*01 TCRBJ01 TCRBJ01-01 01 4 0 4 3 0 1 44 -1 57 -1 62 138 Out VDJ null null null null null null null null null null null null
NO_V:
TTGGAGTCGGCTGCTCCCTCCCAAACATCTGTGTACTTCTGTGCCAGCAGAAAAGCTGGCGCCTACAATGAGCAGTTCTTCGGGCCA CASRKAGAYNEQFF 45 3.0093691693472343E-4 42 TCRBV06 TCRBV06 TCRBV06-02,TCRBV06-03 TCRBD01,TCRBD02 TCRBD01-01,TCRBD02-01 TCRBJ02-01*01 TCRBJ02 TCRBJ02-01 01 6 7 9 0 1 2 39 50 57 60 61 45 In VDJ null null null null null null null null null null null null
NO_CDR3AA:
TCATCCTGAGTTCTAAGAAGCTCCTTCTCAGTGACTCTGGCTTCTATCTCTGTGCCTGGCCAGAATGAAAAACTGTTTTTTGGCAGT 7 4.6812409300956976E-5 31 TCRBV30-01*01 TCRBV30 TCRBV30-01 01 TCRBD01-01*01 TCRBD01 TCRBD01-01 01 TCRBJ01-04*01 TCRBJ01 TCRBJ01-04 01 5 1 4 5 0 5 50 59 60 -1 63 7 Out VDJ null null null null null null null null null null null null
NO_V:
GTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTCCGGCAGGTCGCATGAACACTGAAGCTTTCTTTGGACAA CSAPAGRMNTEAFF 54 3.6112430032166814E-4 42 TCRBV20 TCRBV20 TCRBV20-01,TCRBV20-or09_02 TCRBD01-01*01 TCRBD01 TCRBD01-01 01 TCRBJ01-01*01 TCRBJ01 TCRBJ01-01 01 5 4 4 4 5 0 39 48 52 56 61 54 In VDJ null null null null null null null null null null null null
NO_CDR3AA:
TCCTGAGTTCTAAGAAGCTCCTTCTCAGTGACTCTGGCTTCTATCTCTGTGCCTGGAGGCCCCCGCCTCGAGCAGTACTTCGGGCCG 5 3.343743521496927E-5 34 TCRBV30-01*01 TCRBV30 TCRBV30-01 01 TCRBD01,TCRBD02 TCRBD01-01,TCRBD02-01 TCRBJ02-07*01 TCRBJ02 TCRBJ02-07 01 3 0 10 0 8 6 47 -1 58 60 68 5 Out VDJ null null null null null null null null null null null null
[Mon Sep 19 17:20:09 PDT 2016 ClonotypeStreamParser] Finished parsing. 1 header and 47733 bad line(s) were skipped.
[Mon Sep 19 17:20:09 PDT 2016 SampleStreamConnection] Loaded sample 4651_Day_0 with 85785 clonotypes and 96656 cells. Memory usage: 0 of 1 GB
[Mon Sep 19 17:20:09 PDT 2016 Convert] Processed 1 sample(s).. Writing output
Funny enough, if I read the original TSV files and rewrite them out through R, and then feed those to vdjtools, no warnings anymore:
[Mon Sep 19 17:19:34 PDT 2016 Convert] Reading sample(s)
[Mon Sep 19 17:19:34 PDT 2016 Convert] 12 sample(s) loaded
[Mon Sep 19 17:19:34 PDT 2016 SampleStreamConnection] Loading sample 4651_Day_0
[Mon Sep 19 17:19:37 PDT 2016 ClonotypeStreamParser] Finished parsing. 1 header and 0 bad line(s) were skipped.
[Mon Sep 19 17:19:37 PDT 2016 SampleStreamConnection] Loaded sample 4651_Day_0 with 111757 clonotypes and 149533 cells. Memory usage: 0 of 1 GB
[Mon Sep 19 17:19:37 PDT 2016 Convert] Processed 1 sample(s).. Writing output
However, some clonotypes get still silently discarded by the parser, even in the absence of errors. For example, for this file, there are originally 134344 clonotypes. WIth the first method (ImmunoSeq format, gives some errors) I get 85786, with the second method (R, then ImmunoSeq format, no explicit errors) I get 111758. With ImmunoSeqV3, it won't parse at all.
Also, I noticed that the new AA annotation function does not allow to select the bin size of regions anymore.
Thanks,
Roberto
PS: the code to import/export from R that I used is the following:
library(tidyverse)
input <- list()
for(f in list.files(path=".", pattern="*tsv")) input[[f]] <- read_tsv(f)
for(i in names(input)) input[[i]] %>% write_tsv(path = i)
I have to take back the statement that after reading/writing the data with R, there are missing clonotypes in the absence of errors. I hadn't realized that vdjtools internally consolidates some clonotypes - the count column total is identical for input and output of vdjtools. So it appears that having the data go through R fixes the (v2) parser errors. All R does is inserting NAs where empty cells originally were.
Hi,
congrats on a great piece of software!
I have TCRB data (so VDJ recombination) from ImmunoSEQ. I get a lot of skipped lines when I try to use CalcCdrAaProfile with VD-junc, D-germ and DJ-junc. If I use VJ-junc instead, I don't get any skipped lines. There is a warning in the documentation that ImmunoSEQ data sometimes lacks the full J segment, but this seems to be different, more related to D segments.
From the documentation, it is not clear to me what the behavior of VJ-junc is in the presence of VDJ recombination. Does it still bin the whole V-D-J junction region, or does it take into account only the portion of DNA coming from the V and J segments (i.e. disregarding DNA coming from the D segment)? Would it be possible to edit the documentation to make this point clear?
Thank you, Roberto