fkvh commented 9 years ago

Hi, Is it possible to make changes in pca.segments? I would like to change the colours in ggplot and also drop the sample names in the figure. Would you please, let me know how to fix it?

vadimnazarov commented 9 years ago

Hello! I just updated the package, use the devtools::install_github function to update it. I've added the .text argument to the pca.segments function. If you specify .text = F then no sample names will appear in the resulting plot.

To change colours of the plot you can just add them with ggplot2 functions, like that: pca.segments(your_data, .genes = HUMAN_TRBV, .text = F) + scale_colour_manual(values = c("green", "blue", "red", "yellow")) (I highly recommend you to use "ggplot2 cookbook" from here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ )

fkvh commented 9 years ago

Dear Vadim

Thanks for your kind attention. I am trying to update tar package but I receive the below error:

Warning message:

package 'tcR' is not available (as a binary package for R version 3.1.3)

install.packages("devtools")

Installing package into '/Users/fatemek_adm/Library/R/3.1/library'

(as 'lib' is unspecified)

trying URL 'http://ftp.acc.umu.se/mirror/CRAN/bin/macosx/contrib/3.1/devtools_1.8.0.tgz'

Content type 'application/x-tar' length 325742 bytes (318 KB)

opened URL

downloaded 318 KB

The downloaded binary packages are in

/var/folders/gg/xxsb43x50_d4bfwg5m4jc4c40000gp/T//RtmpPCbf7L/downloaded_packages

devtools::install_github("imminfo/tcR", build_vignettes = FALSE)

Downloading github repo imminfo/tcR@master

Installing tcR

'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ \

--no-save --no-restore CMD INSTALL \

'/private/var/folders/gg/xxsb43x50_d4bfwg5m4jc4c40000gp/T/RtmpPCbf7L/devtools13952306c9d2d/imminfo-tcr-cf4cad9' \

--library='/Users/fatemek_adm/Library/R/3.1/library' --install-tests

installing source package 'tcR' ...

\ libs

I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Users/fatemek_adm/Library/R/3.1/library/Rcpp/include" -c RcppExports.cpp -o RcppExports.o

/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory

make: [RcppExports.o] Error 127 (ignored)

I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Users/fatemek_adm/Library/R/3.1/library/Rcpp/include" -c neighbour.search.cpp -o neighbour.search.o

/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory

make: [neighbour.search.o] Error 127 (ignored)

-dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -L/usr/local/lib -o tcR.so RcppExports.o neighbour.search.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

/bin/sh: -dynamiclib: command not found

make: *\ [tcR.so] Error 127

ERROR: compilation failed for package 'tcR'

removing '/Users/fatemek_adm/Library/R/3.1/library/tcR'

Error: Command failed (1)?

Thanks again for all your help.

Best regards,

Fatemeh

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 19 May 2015 16:47 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hello! I just updated the package, use the devtools::install_github function to update it. I've added the .text argument to the pca.segments function. If you specify .text = F then no sample names will appear in the resulting plot.

To change colours of the plot you can just add them with ggplot2 functions, like that: pca.segments(your_data, .genes = HUMAN_TRBV, .text = F) + scale_colourmanual(values = c("green", "blue", "red", "yellow")) (I highly recommend you to use "ggplot2 cookbook" from here: http://www.cookbook-r.com/Graphs/Colors(ggplot2)/ )

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-103532322.

vadimnazarov commented 9 years ago

Hello Fatemeh!

It seems that gcc compiler for C++ language that is needed for building tcR is not installed on your computer. In order to install development versions of tcR you need to install it because some functions in tcR is written in C++. If you have a Ubuntu you could just type in the console something like this:

sudo apt-get install g++

Probably, more simple way in a short run is that if you just use alternative functions for PCA on genes that I just wrote that available from here: https://gist.github.com/vadimnazarov/3fc87945271a7eb486cf

I removed colours and set the .text parameter value to FALSE by default. I hope this will help. Feel free to ask any questions if you have any problems and bugs.

fkvh commented 9 years ago

Hello again

I am sorry for disturbing but I have a deadline.

First I installed gcc compiler for my Mac OXS 10.9 & looks that I have it but still I get the same error.

As I have to generate pca plot for my file I tried to run the command that you wrote in github:

data = immdata1

pcaGenes <- function(.data, .cast.freq.seg = T, ..., .text = F, .do.plot = T){

if (.cast.freq.seg) { .data <- geneUsage(.data, ...)[,-1] }

pca.res <- prcomp(t(as.matrix(.data)), ...)

if (.do.plot) {

pca.res <- data.frame(PC1 = pca.res$x[,1], PC2 = pca.res$x[,2], Subject = names(.data))

p <- ggplot() + geom_point(aes(x = PC1, y = PC2, colour = Subject), size = 3, data = pca.res)

if (.text) {

p <- p + geom_text(aes(x = PC1, y = PC2, label = Subject), data = pca.res, hjust=.5, vjust=-.3)

}

p + theme_linedraw() + guides(size=F) + ggtitle("Gene usage: Principal Components Analysis")

} else {

pca.res

}

} pcaGenes(immdata1) Error in pcaGenes(immdata1) : could not find function "geneUsage"

?Thanks again, Fatemeh

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 20 May 2015 09:44 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hello Fatemeh!

It seems that gcc compiler for C++ language that is needed for building tcR is not installed on your computer. In order to install development versions of tcR you need to install it because some functions in tcR is written in C++. If you have a Ubuntu you could just type in the console something like this:

sudo apt-get install g++

Probably, more simple way in a short run is that if you just use alternative functions for PCA on genes that I just wrote available from here: https://gist.github.com/vadimnazarov/3fc87945271a7eb486cf

I removed colours and set the .text parameter value to FALSE by default. I hope this will help. Feel free to ask any questions if you have any problems and bugs.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-103797902.

vadimnazarov commented 9 years ago

Hi, no problem. First, probably you should check if tcR is loaded to your workspace (type library(tcR) and run pcaGenes again). If so and code is not working, then it seems that you are using the old version of the package. Try to install tcR from CRAN via install.packages("tcR") and run pcaGenes again on your data. If you couldn't update to the latest version then try to use this code:

https://gist.github.com/vadimnazarov/73060555c157ecdae9fe

I hope this will help, feel free to comment if something isn't working.

fkvh commented 9 years ago

Thanks for your help.

I finally installed the new version but I am wondering if you have made any changes into data frame? My data is not an output of MiTCR & I made changes to be loaded in tcR , it worked for older version but in new version I get error for parse.file. E.g:

immdata_ALD1 = parse.file("~/ALD_1_In_MiTCR.csv") Error in [.data.frame(df, , make.names(.reads)) : undefined columns selected

Best regards, Fatemeh

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 20 May 2015 13:10 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hi, no problem. First, probably you should check if tcR is loaded to your workspace (type library(tcR) and run pcaGenes again). If so and code is not working, then it seems that you are using the old version of the package. Try to install tcR from CRAN via install.packages("tcR") and run pcaGenes again on your data. If you couldn't update to the latest version then try to use this code:

https://gist.github.com/vadimnazarov/73060555c157ecdae9fe

I hope this will help.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-103845951.

vadimnazarov commented 9 years ago

Yes, latest version is 2.0 and in that version I standardised the cloneset data frame structure, that (I hope) will stay persistent in the future because it contains full information about a cloneset in contrast to the previous cloneset data frame. I'm really very sorry for this inconvenience. Columns and their semantics are described in the manual here: https://imminfo.github.io/tcr/tcrvignette.html#cloneset-representation

If I understood you correctly (that your .csv file is equivalent to MiTCR output) you just need to add something to the first line in your .csv file (because MiTCR output's first line is not column names) and apply parse.file(yourfile, "mitcr") to your data.

fkvh commented 9 years ago

Dear Vadim,

I have tried to changed the format of my files into version 2 but still I get the error message as below.

I attach one of my samples that I have made it as the same format of your test files like twb / twa.

I really appreciate if you can help me to solve it. It looks like that its data frame has problem :

write.table(mydata, file = "ALD_1_In_MiTCR.csv", quote=F, sep="\t", row.names = FALSE, col.names=TRUE, qmethod = c("escape", "double"), fileEncoding = "")

immdata_ALD1 = parse.file("~/ALD_1_In_MiTCR.csv")

Error in [.data.frame(df, , make.names(.reads)) :

undefined columns selected?

Thanks again for your kind attention,

Fatemeh

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 20 May 2015 14:33 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Yes, latest version is 2.0 and in that version I standardised the cloneset data frame structure, that (I hope) will stay persistent in the future because it contains full information about a cloneset in contrast to the previous cloneset data frame. I'm very sorry for this inconvenience. Columns and their semantics are described in the manual here: https://imminfo.github.io/tcr/tcrvignette.html#cloneset-representation

If I understood you correctly (that you .csv file is equivalent to MiTCR output) you just need to apply parse.file(yourfile, "mitcr") to your data.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-103865614.

vadimnazarov commented 9 years ago

Hi Fatemeh,

could you apply a function str to your data and show me the output please? If you have a twb-like data frames, than you can just use it without reloading with the parse.file function. I think parse.file isn't working here because it tries to skip the first row, so it excepts input like that:

something-something MiTCR specific line Read.count Percentage CDR3.nucleotide.sequence ... 100 .05 TACGATCGATCGATCGATCT ...

, but if you save your data with write.table, it saves the data like that:

Read.count Percentage CDR3.nucleotide.sequence ... 100 .05 TACGATCGATCGATCGATCT ...

But, again, if you have your data in equal to twa/twb format, than you don't need to reload it, if I understood you correctly. You can just arrange your column so the first one will be for UMI counts, second one for UMI proportions and so one like in the manual, and then change column names, and that's all. Just make sure that your data don't contains factors.

Also I don't see any attachments, probably due to GitHub limitations. You can use https://gist.github.com to make text files. After making new text file you could just post a link to it here. Or you could e-mail me directly.

fkvh commented 9 years ago

Dear Vadim,

Thanks for your help. You are right, I had some factors in my structure. Now, I have fixed the format of my sample as below:

str(mydata) 'data.frame': 1860 obs. of 16 variables: $ Umi.count : logi NA NA NA NA NA NA ... $ Umi.proportion : logi NA NA NA NA NA NA ... $ read.count : int 55909 15603 7155 6599 3315 2689 2536 2527 2332 2307 ... $ Read.proportion : num 12.083 3.372 1.546 1.426 0.716 ... $ CDR3.nucleotide.sequence: chr "AGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTGAAATGAGTCTGGAGCACTACACCTTCGGTTCG" "GTGAACGCCTTGTTGCTGGGGGACTCGGCCCTCTATCTCTGTGCCAGCAGCTCGCCAGTGGAGGGGGAAAAACTGTTTTTTGGCAGT" ... $ CDR3.amino.acid.sequence: chr "CASSEMSLEHYTF" "CASSSPVEGEKLFF" "CASSLVLRAEQYF" "CASSLGRQGRMNEQFF" ... $ V.gene : chr "TCRBV06-01" "TCRBV05-06" "TCRBV07-08" "TCRBV07-03" ... $ J.gene : chr "TCRBJ01-02" "TCRBJ01-04" "TCRBJ02-07" "TCRBJ02-01" ... $ D.gene : chr "TCRBD02-01" "TCRBD02-01" "unresolved" "TCRBD01-01" ... $ V.end : int 56 51 57 47 51 54 46 -2 55 36 ... $ J.start : int 71 66 68 66 60 62 62 69 66 65 ... $ D5.end : int 65 59 64 52 55 57 51 55 59 46 ... $ D3.end : int 68 64 -2 59 -2 60 -2 60 63 55 ... $ VD.insertions : int 8 7 6 4 3 2 4 0 3 9 ... $ DJ.insertions : int 2 1 0 6 0 1 0 8 2 9 ... $ Total.insertions : int 10 8 6 10 3 3 4 8 5 18 ...

Is there any way to add one extra sentence to my file which consider as "MiTCR" format & do parse.file? Any how as you recommended I will skip parse.file but I have 40 samples as above. How to parse them which can do pac.segments? Thanks again for your kind attention, Fatemeh

?

From: Vadim Nazarov notifications@github.com Sent: 21 May 2015 22:20 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hi Fatemeh,

could you apply a function str to your data and show me the output please? If you have a twb-like data frames, than you can just use it without reloading with the parse.file function. I think parse.file isn't working here because it tries to skip first row, so it excepts input like that:

something-something MiTCR specific line Read.count Percentage CDR3.nucleotide.sequence ... 100 .05 TACGATCGATCGATCGATCT ...

, but if you save your data with write.table, it saves the data like that:

Read.count Percentage CDR3.nucleotide.sequence ... 100 .05 TACGATCGATCGATCGATCT ...

But, again, if you have your data in equal to twa/twb format, than you don't need to reload it, if I understood you correctly. You can just arrange your column so the first one will be for UMI counts, second one for UMI proportions and so one like in the manual, and then change column names, and that's all. Just make sure that your data don't contains factors.

Also I don't see any attachments, probably due to GitHub. You can use https://gist.github.com to make text files. After making new text file you could just post a link to it here.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-104410235.

vadimnazarov commented 9 years ago

Hi Fatemeh,

You have third column in your data called "read.count", but it should be "Read.count" for the correct work of the package, e.g., like that: names(your_data)[3] <- "Read.count".

I see numbers in Read.proportion more than one, is it ok? your_data$Read.proportion <- your_data$Read.count / sum(your_data$Read.count)

Also you have non-IMGT nomenclature for genes (link to IMGT genes: http://www.imgt.org/IMGTrepertoire/LocusGenes/repertoires/human/TRB/TRBV/Hu_TRBVrep.html ), but tcR is based on IMGT nomenclature, so for the correct work of the package you could change gene names like that: your_data$V.gene <- paste0("TR", substring(your_data$V.gene, 4)) your_data$J.gene <- paste0("TR", substring(your_data$J.gene, 4)) your_data$D.gene <- paste0("TR", substring(your_data$D.gene, 4))

Do you need alleles, i.e., last two integers after the -? Sadly, tcR is not support them (probably, yet). if you don't, then you could remove them with

your_data$V.gene <- sapply(strsplit(your_data$V.gene, "-", T, F, T), "[[", 1) your_data$J.gene <- sapply(strsplit(your_data$J.gene, "-", T, F, T), "[[", 1) your_data$D.gene <- sapply(strsplit(your_data$D.gene, "-", T, F, T), "[[", 1)

I have code a function to do all of this with genes at once, you could see it here: https://gist.github.com/vadimnazarov/2ca4094ef02162efba2b

If you do need alleles, then just tell me this and I will add them as a separate gene list.

Regarding MiTCR output. There is a function "parse.cloneset" (see a help for it via ?parse.cloneset ), that as input receives parameters for parsing (like name of columns, separator and number of lines to skip) and outputs a tcR cloneset. If you have many files with equal structure (i.e., equal column names), you could use this function. For Umi count column you can supply .barcodes = NA if UMIs are not presented in your data. To supply your function to a folder you can do this (without renaming columns in each of your samples, you only need to save all of them to a folder in an unified way):

parseFun <- function (x) {
parse.cloneset(x, .barcodes = NA, .nuc.seq = "CDR3.nucleotide.sequence", ...) # specify here column names
}

parseFold <- function (folder) {
  lapply(list.files(folder), parseFun)
}

yourData <- parseFold("/path to your folder with data/")

If you have any trouble, you could send me column names of your samples and I will write for you a function for parsing them.

fkvh commented 9 years ago

Dear Vadim,

Thanks for all your help & sorry for my delay (I was on a course).

I have tried to follow what you suggested for my data (Have you ever used ImmunoSEQ output?)

I changed the structure as you told me. Now data looks like this:

str(mydata)

'data.frame': 1860 obs. of 16 variables:

$ Umi.count : logi NA NA NA NA NA NA ...

$ Umi.proportion : logi NA NA NA NA NA NA ...

$ Read.count : int 55909 15603 7155 6599 3315 2689 2536 2527 2332 2307 ...

$ Read.proportion : num 12.083 3.372 1.546 1.426 0.716 ...

$ CDR3.nucleotide.sequence: chr "AGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTGAAATGAGTCTGGAGCACTACACCTTCGGTTCG" "GTGAACGCCTTGTTGCTGGGGGACTCGGCCCTCTATCTCTGTGCCAGCAGCTCGCCAGTGGAGGGGGAAAAACTGTTTTTTGGCAGT" "AAGATCCAGCGCACACAGCAGGAGGACTCCGCCGTGTATCTCTGTGCCAGCAGCTTAGTCCTACGGGCCGAGCAGTACTTCGGGCCG" "CGCACAGAGCGGGGGGACTCAGCCGTGTATCTCTGTGCCAGCAGCTTAGGTCGACAGGGGAGGATGAATGAGCAGTTCTTCGGGCCA" ...

$ CDR3.amino.acid.sequence: chr "CASSEMSLEHYTF" "CASSSPVEGEKLFF" "CASSLVLRAEQYF" "CASSLGRQGRMNEQFF" ...

$ V.gene : chr "TRBV06" "TRBV05" "TRBV07" "TRBV07" ...

$ J.gene : chr "TRBJ01" "TRBJ01" "TRBJ02" "TRBJ02" ...

$ D.gene : chr "TRBD02" "TRBD02" "TResolved" "TRBD01" ...

$ V.end : int 56 51 57 47 51 54 46 -2 55 36 ...

$ J.start : int 71 66 68 66 60 62 62 69 66 65 ...

$ D5.end : int 65 59 64 52 55 57 51 55 59 46 ...

$ D3.end : int 68 64 -2 59 -2 60 -2 60 63 55 ...

$ VD.insertions : int 8 7 6 4 3 2 4 0 3 9 ...

$ DJ.insertions : int 2 1 0 6 0 1 0 8 2 9 ...

$ Total.insertions : int 10 8 6 10 3 3 4 8 5 18 ...

In this function:

parseFun <- function (x) { parse.cloneset(x, .barcodes = NA, .nuc.seq = "CDR3.nucleotide.sequence", ...) # specify here column names }

I have made as:

parseFun <- function (x) {

parse.cloneset(x, .barcodes = NA, .nuc.seq = "CDR3.nucleotide.sequence", .aa.seq = "CDR3.amino.acid.sequence", .reads = "Read.count", .vgenes = "V.gene", .jgenes = "J.gene", .dgenes = "D.gene", .vend = "V.end", .jstart = "J.start",  .dalignments = "D3.end" .dalignments = "D5.end", .dalignments = "D3.end", .vd.insertions = "VD.insertions", .dj.insertions = "DJ.insertions", .total.insertions = "Total.insertions", .skip = 0, .sep = "\t")

}

But there is an error in .dalignments.

I have D3.end & D5.end (as in twb sample data).Shall I make dalignments = mydata$D3.end - mydata$D5.end

I hope I can parse my files. I have 40 samples in the same format which I need to parse them & go through PCA.

Thanks again for you kindly help.

Best regards,

Fatima

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 22 May 2015 15:10 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hi Fatemeh,

You have third column in your data called "read.count", but it should be "Read.count" for the correct work of the package, e.g., like that: names(your_data)[3] <- "Read.count".

I see numbers in Read.proportion more than one, is it ok? your_data$Read.proportion <- your_data$Read.count / sum(your_data$Read.count)

Also you have non-IMGT nomenclature for genes (link to IMGT genes: http://www.imgt.org/IMGTrepertoire/LocusGenes/repertoires/human/TRB/TRBV/Hu_TRBVrep.html ), but tcR is based on IMGT nomenclature, so for the correct work of the package you could change gene names like that: your_data$V.gene <- paste0("TR", substring(your_data$V.gene, 4)) your_data$J.gene <- paste0("TR", substring(your_data$J.gene, 4)) your_data$D.gene <- paste0("TR", substring(your_data$D.gene, 4))

Do you need alleles, i.e., last two integers after the -? Sadly, tcR is not support them (probably, yet). if you don't, then you could remove them with

your_data$V.gene <- sapply(strsplit(your_data$V.gene, "-", T, F, T), "[[", 1) your_data$J.gene <- sapply(strsplit(your_data$J.gene, "-", T, F, T), "[[", 1) your_data$D.gene <- sapply(strsplit(your_data$D.gene, "-", T, F, T), "[[", 1)

I have code a function to do all of this with genes at once, you could see it here: https://gist.github.com/vadimnazarov/2ca4094ef02162efba2b

If you do need alleles, then just tell me this and I will add them as a separate gene list.

Regarding MiTCR output. There is a function "parse.cloneset" (see a help for it via ?parse.cloneset ), that as input receives parameters for parsing (like name of columns, separator and number of lines to skip) and outputs a tcR cloneset. If you have many files with equal structure (i.e., equal column names), you could use this function. For Umi count column you can supply .barcodes = NA if UMIs are not presented in your data. To supply your function to a folder you can do this (without renaming columns in each of your samples, you only need to save all of them to a folder in an unified way):

parseFun <- function (x) { parse.cloneset(x, .barcodes = NA, .nuc.seq = "CDR3.nucleotide.sequence", ...) # specify here column names }

parseFold <- function (folder) { lapply(list.files(folder), parseFun) }

yourData <- parseFold("/path to your folder with data/")

If you have any trouble, you could send me column names of your samples and I will write for you a function for parsing them.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-104655739.

vadimnazarov commented 9 years ago

Hello!

Ah, I see the problem more clearly now. So you are using ImmunoSEQ output, right? Could you please give me column names of the raw output ImmunoSEQ file? I will try to write a parser for it.

Another and probably faster way is to use VDJtools. It's also a software for TCR analysis, but it doesn't allows you to manipulate data like in R. So it depends on what you need for the analysis. You could install it here: http://vdjtools-doc.readthedocs.org/en/latest/install.html You could run VDJtools Convert utility and convert your ImmunoSEQ data to the VDJtools format and then load it to tcR if you need to. Converting is here: http://vdjtools-doc.readthedocs.org/en/latest/input.html#immunoseq] Parser for the VDJtools format will be available at the today's evening or tomorrow.

fkvh commented 9 years ago

Dear Vadim,

Thank you so much, that would be great if you help to parse ImmunoSEQ files. I really need pca in your package.

Here is the column names of original file from ImmunoSEQ

DataALD_1 = read.table("ALD_1_3986_In.tsv", header=TRUE, sep="\t", row.names=NULL, stringsAsFactors=F)

names(DataALD_1)

[1] "nucleotide" "aminoAcid" "count"

[4] "frequencyCount...." "cdr3Length" "vMaxResolved"

[7] "vFamilyName" "vGeneName" "vGeneAllele"

[10] "vFamilyTies" "vGeneNameTies" "vGeneAlleleTies"

[13] "dMaxResolved" "dFamilyName" "dGeneName"

[16] "dGeneAllele" "dFamilyTies" "dGeneNameTies"

[19] "dGeneAlleleTies" "jMaxResolved" "jFamilyName"

[22] "jGeneName" "jGeneAllele" "jFamilyTies"

[25] "jGeneNameTies" "jGeneAlleleTies" "vDeletion"

[28] "n1Insertion" "d5Deletion" "d3Deletion"

[31] "n2Insertion" "jDeletion" "vIndex"

[34] "n1Index" "dIndex" "n2Index"

[37] "jIndex" "estimatedNumberGenomes" "sequenceStatus"

[40] "cloneResolved" "vOrphon" "dOrphon"

[43] "jOrphon" "vFunction" "dFunction"

[46] "jFunction" "fractionNucleated"

here is the structure:

str(DataALD_1)

'data.frame': 1860 obs. of 47 variables:

$ nucleotide : chr "AGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTGAAATGAGTCTGGAGCACTACACCTTCGGTTCG" "GTGAACGCCTTGTTGCTGGGGGACTCGGCCCTCTATCTCTGTGCCAGCAGCTCGCCAGTGGAGGGGGAAAAACTGTTTTTTGGCAGT" "AAGATCCAGCGCACACAGCAGGAGGACTCCGCCGTGTATCTCTGTGCCAGCAGCTTAGTCCTACGGGCCGAGCAGTACTTCGGGCCG" "CGCACAGAGCGGGGGGACTCAGCCGTGTATCTCTGTGCCAGCAGCTTAGGTCGACAGGGGAGGATGAATGAGCAGTTCTTCGGGCCA" ...

$ aminoAcid : chr "CASSEMSLEHYTF" "CASSSPVEGEKLFF" "CASSLVLRAEQYF" "CASSLGRQGRMNEQFF" ...

$ count : int 55909 15603 7155 6599 3315 2689 2536 2527 2332 2307 ...

$ frequencyCount.... : num 12.083 3.372 1.546 1.426 0.716 ...

$ cdr3Length : int 39 42 39 48 36 42 48 42 39 54 ...

$ vMaxResolved : chr "TCRBV06-01_01" "TCRBV05-06_01" "TCRBV07-08_01" "TCRBV07-03_01" ...

$ vFamilyName : chr "TCRBV06" "TCRBV05" "TCRBV07" "TCRBV07" ...

$ vGeneName : chr "TCRBV06-01" "TCRBV05-06" "TCRBV07-08" "TCRBV07-03" ...

$ vGeneAllele : int 1 1 1 1 NA 1 NA 1 NA NA ...

$ vFamilyTies : chr "" "" "" "" ...

$ vGeneNameTies : chr "" "" "" "" ...

$ vGeneAlleleTies : chr "" "" "" "" ...

$ dMaxResolved : chr "TCRBD02-01_02" "TCRBD02-01_02" "unresolved" "TCRBD01-01*01" ...

$ dFamilyName : chr "TCRBD02" "TCRBD02" "" "TCRBD01" ...

$ dGeneName : chr "TCRBD02-01" "TCRBD02-01" "unresolved" "TCRBD01-01" ...

$ dGeneAllele : int 2 2 NA 1 1 NA NA 1 1 1 ...

$ dFamilyTies : chr "" "" "TCRBD01,TCRBD02" "" ...

$ dGeneNameTies : chr "" "" "TCRBD01-01,TCRBD02-01" "" ...

$ dGeneAlleleTies : chr "" "" "" "" ...

$ jMaxResolved : chr "TCRBJ01-02_01" "TCRBJ01-04_01" "TCRBJ02-07_01" "TCRBJ02-01_01" ...

$ jFamilyName : chr "TCRBJ01" "TCRBJ01" "TCRBJ02" "TCRBJ02" ...

$ jGeneName : chr "TCRBJ01-02" "TCRBJ01-04" "TCRBJ02-07" "TCRBJ02-01" ...

$ jGeneAllele : int 1 1 1 1 1 1 1 1 1 1 ...

$ jFamilyTies : logi NA NA NA NA NA NA ...

$ jGeneNameTies : logi NA NA NA NA NA NA ...

$ jGeneAlleleTies : logi NA NA NA NA NA NA ...

$ vDeletion : int 2 3 1 2 10 0 3 1 3 7 ...

$ n1Insertion : int 8 7 6 4 3 2 4 0 3 9 ...

$ d5Deletion : int 10 10 8 2 2 6 0 2 5 0 ...

$ d3Deletion : int 2 0 0 2 5 2 5 4 2 2 ...

$ n2Insertion : int 2 1 0 6 0 1 0 8 2 9 ...

$ jDeletion : int 10 8 6 7 0 0 1 7 5 4 ...

$ vIndex : int 42 39 42 33 45 39 33 39 42 27 ...

$ n1Index : int 57 52 58 48 52 55 47 -1 56 37 ...

$ dIndex : int 65 59 64 52 55 57 51 55 59 46 ...

$ n2Index : int 69 65 -1 60 -1 61 -1 61 64 56 ...

$ jIndex : int 71 66 68 66 60 62 62 69 66 65 ...

$ estimatedNumberGenomes: int 363 97 44 44 17 16 16 14 12 15 ...

$ sequenceStatus : chr "In" "In" "In" "In" ...

$ cloneResolved : chr "VDJ" "VDJ" "VDJ" "VDJ" ...

$ vOrphon : logi NA NA NA NA NA NA ...

$ dOrphon : logi NA NA NA NA NA NA ...

$ jOrphon : logi NA NA NA NA NA NA ...

$ vFunction : logi NA NA NA NA NA NA ...

$ dFunction : logi NA NA NA NA NA NA ...

$ jFunction : logi NA NA NA NA NA NA ...

$ fractionNucleated : num 0.005104 0.001424 0.000653 0.000602 0.000303 ...

These may help into converting:

read.count => count

Percentage => frequencyCount(%)

CDR3.nucleotide.sequence => nucleotide

CDR3.amino.acid.sequence => aminoAcid

V.alleles => vMaxResolved => WE DO NOT NEED IT

V.segments => vGeneName

J.alleles => jMaxResolved => WE DO NOT NEED IT

J.segments => jGeneName

D.alleles => dGeneName => WE DO NOT NEED IT

D.segmenst => dFamilyName

Last.V.nucleotide.position => n1Index

First.D.nucleotide.position => dIndex

Last.D.nucleotide.position => n2Index

First.J.nucleotide.position => jIndex

VD.insertions => n1Insertion

DJ.insertions => n2Insertion

Total.insertions => SUM(n1Insertion + n2Insertion)

I have selected & changed data as below:

ALD_1 = DataALD_1[,c(3,4,1,2,8,22,15,34,37,35,36,28,31)]

Total.insertions = ALD_1$n1Insertion + ALD_1$n2Insertion

mydata = cbind(ALD_1,Total.insertions)

dim(mydata)

1860 14

For new version 2.0 we need to add two NA columns in the begining

data1 = as.data.frame(append(mydata, list(C = NA), after = 0))

data2 = as.data.frame(append(data1, list(D = NA), after = 0))

dim(data2)

1860 16

mydata = data2

Now we change the column names based on MiTCR column names:

colnames(mydata) = c("Umi.count", "Umi.proportion", "Read.count", "Read.proportion", "CDR3.nucleotide.sequence", "CDR3.amino.acid.sequence", "V.gene", "J.gene", "D.gene", "V.end", "J.start", "D5.end", "D3.end", "VD.insertions", "DJ.insertions", "Total.insertions")

Note: V.end column should be "V.end-1" or "n1Index -1", e.g. If n1Index is 47, the last V nucleotide position is 47-1=46

mydata$V.end = mydata$V.end -1

Note: D3.end should be "D3.end-1" or "n2Index-1", e.g. If n2Index is 63, the last D nucleotide position is 63-1=62

mydata$D3.end = mydata$D3.end -1

To sort based on Percentage(%) ("-" does the sorting in reverse order)

sort_mydata = mydata[order(-mydata$Read.proportion),]

dim(sort_mydata)

1860 16

mydata = sort_mydata

i <- sapply(mydata, is.factor)

mydata[i] <- lapply(mydata[i], as.character)

mydata$D3.end = as.integer(mydata$D3.end)

mydata$V.end = as.integer(mydata$V.end)

AS you recommended:

mydata$V.gene <- paste0("TR", substring(mydata$V.gene, 4))

mydata$J.gene <- paste0("TR", substring(mydata$J.gene, 4))

mydata$D.gene <- paste0("TR", substring(mydata$D.gene, 4))

mydata$V.gene <- sapply(strsplit(mydata$V.gene, "-", T, F, T), "[[", 1)

mydata$J.gene <- sapply(strsplit(mydata$J.gene, "-", T, F, T), "[[", 1)

mydata$D.gene <- sapply(strsplit(mydata$D.gene, "-", T, F, T), "[[", 1)

Thanks again, Fatima

?

From: Vadim Nazarov notifications@github.com Sent: 09 June 2015 13:57 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hello!

Ah, I see the problem more clearly now. So you are using ImmunoSEQ output, right? Could you please give me column names of the raw output ImmunoSEQ file? I will try to write a parser for it.

Another and probably faster way is to use VDJtools. It's also a software for TCR analysis, but it doesn't allows you to manipulate data in R. So it depends on what you need for the analysis. You could install it here: http://vdjtools-doc.readthedocs.org/en/latest/install.html You could run VDJtools Convert utility and convert your ImmunoSEQ data to VDJtools format and then load it to tcR if you need to. Converting is here: http://vdjtools-doc.readthedocs.org/en/latest/input.html#immunoseq

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-110330860.

vadimnazarov commented 9 years ago

Dear Fatemeh,

I've just added the new ImmunoSEQ parser (parse.immunoseq function). It trims nucleotide sequences to CDR3 only, translates noncoding sequences and fixes V-D-J gene names and positions. I need to check genes boundaries and maybe correct them by substracting 1 (as you said), but despite this it works. Please feel free to contact me if you have any problems with this function. I'm not sure that I fully understand this format because on the data that I've got there are many unresolved genes (~50%, the "Ambiguous" row) despite that the vFamilyName column is filled, and some other issues.

> geneUsage(parse.immunoseq("~/Downloads/immunoseq.tsv"), HUMAN_TRBV, .ambig = T)
        Gene Sample
1  Ambiguous  18570
2   TRBV10-1    280
3   TRBV10-2    140
4   TRBV10-3    852
5   TRBV11-1    112
6   TRBV11-2    708
7   TRBV11-3    341
8   TRBV12-3      0
9   TRBV12-4      0
10  TRBV12-5    311
11    TRBV13      0
12    TRBV14      0
13    TRBV15      0
14    TRBV16      0
15    TRBV18      0
16    TRBV19      0
17     TRBV2      0
18  TRBV20-1    182
19  TRBV21-1    664
20  TRBV23-1    251
21  TRBV24-1      0
22  TRBV25-1    331
23    TRBV27      0
24    TRBV28      0
25  TRBV29-1    687
26   TRBV3-1      2
27    TRBV30      0
28   TRBV4-1    492
29   TRBV4-2    331
30   TRBV4-3    776
31   TRBV5-1   1845
32   TRBV5-4    802
33   TRBV5-5    153
34   TRBV5-6    584
35   TRBV5-8    188
36   TRBV6-1    776
37   TRBV6-2      0
38   TRBV6-3      0
39   TRBV6-4    292
40   TRBV6-5   1019
41   TRBV6-6    453
42   TRBV6-7     55
43   TRBV7-1      1
44   TRBV7-2   1717
45   TRBV7-3    582
46   TRBV7-4     50
47   TRBV7-6    336
48   TRBV7-7    165
49   TRBV7-8    493
50   TRBV7-9   1292
51     TRBV9      0

fkvh commented 9 years ago

Dear Vadim,

Thank you, parse.immunoseq function works for each sample. As you know from the beginning the aim is to look at PCA plot for samples (e.g 30 samples: divided into 3 groups), with out text & possibility to give different colours to each group

I am wondering how to merge those parse files in one file as you have "parse.folder" function which merges all samples into one variable. Then pca.segments can be applied.

Thanks again for your kindly help.

Fatima

From: Vadim Nazarov notifications@github.com Sent: 12 June 2015 21:21 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Dear Fatemeh,

I've just added the new ImmunoSEQ parser (parse.immunoseq function). It trims nucleotide sequences to CDR3 only, translates noncoding sequences and fixes V-D-J gene names and positions. I need to check genes boundaries and maybe correct them by substracting 1 (as you said), but despite this it works. Please feel free to contact me if you have any problems with this function. I'm not sure that I fully understand this format because on the data that I've got there are many unresolved genes (~50%, the "Ambiguous" row) despite that the vFamilyName column is filled, and some other issues.

geneUsage(parse.immunoseq("~/Downloads/immunoseq.tsv"), HUMAN_TRBV, .ambig = T) Gene Sample 1 Ambiguous 18570 2 TRBV10-1 280 3 TRBV10-2 140 4 TRBV10-3 852 5 TRBV11-1 112 6 TRBV11-2 708 7 TRBV11-3 341 8 TRBV12-3 0 9 TRBV12-4 0 10 TRBV12-5 311 11 TRBV13 0 12 TRBV14 0 13 TRBV15 0 14 TRBV16 0 15 TRBV18 0 16 TRBV19 0 17 TRBV2 0 18 TRBV20-1 182 19 TRBV21-1 664 20 TRBV23-1 251 21 TRBV24-1 0 22 TRBV25-1 331 23 TRBV27 0 24 TRBV28 0 25 TRBV29-1 687 26 TRBV3-1 2 27 TRBV30 0 28 TRBV4-1 492 29 TRBV4-2 331 30 TRBV4-3 776 31 TRBV5-1 1845 32 TRBV5-4 802 33 TRBV5-5 153 34 TRBV5-6 584 35 TRBV5-8 188 36 TRBV6-1 776 37 TRBV6-2 0 38 TRBV6-3 0 39 TRBV6-4 292 40 TRBV6-5 1019 41 TRBV6-6 453 42 TRBV6-7 55 43 TRBV7-1 1 44 TRBV7-2 1717 45 TRBV7-3 582 46 TRBV7-4 50 47 TRBV7-6 336 48 TRBV7-7 165 49 TRBV7-8 493 50 TRBV7-9 1292 51 TRBV9 0

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-111593762.

vadimnazarov commented 9 years ago

Dear Fatemeh,

I'm very glad that all is working ok. For parsing the entire folder you just call the function like this:

your_Data <- parse.folder(path, "immunoseq").

I forgot to add "immunoseq" to the parse.folder function, but I fixed it just now.

fkvh commented 9 years ago

Dear Vadim,

Thanks for your help. Here I get some error. Please, correct me if some thing is wrong.

I tried different methods to save parse file:

E.g:

Sample1 = parse.immunoseq("Sample1_In.tsv") Sample2 = parse.immunoseq("Sample2_In.tsv")

How should I save them?

I have tried as below but every time there is error in parse.folder:

save(Sample1, file = "/Users/Results/Sample1.rda") or save(Sample1, file = "/Users/Results/Sample1.tsv") or write.table(Sample1, file = "Sample1.csv", quote=F, sep="\t", row.names = FALSE, col.names=FALSE, qmethod = c("escape", "double"), fileEncoding = "")

the same as for Sample2, ....

But when I apply parse.folder functions you said: your_data <- parse.folder("/Users/Results", "immunoseq")

I get again error for

Error in [.data.frame(df, , make.names(.reads)) :

undefined columns selected?

Can you help me? Best regards, Fatima

I also updated the last version so immunoseq is added into parse.folder.

From: Vadim Nazarov notifications@github.com Sent: 14 June 2015 10:48 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Dear Fatemeh,

I'm very glad that all is working ok. For parsing the entire folder you just call the function like this:

your_Data <- parse.folder(path, "immunoseq").

I forgot to add "immunoseq" to the parse.folder function, but I just fixed it.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-111800695.

fkvh commented 9 years ago

Hi again,

I tested for parse.file.list & it worked.

Thank you so much

Fatima

Fatemeh Kaveh, PhD Medical Genetics Department Oslo University Hospital (Ullevål) Kirkeveien 166 0407 Oslo, Norway Tel: +47 23016421 E-Mail: fatemeh.kaveh@medisin.uio.no

From: Vadim Nazarov notifications@github.com Sent: 14 June 2015 10:48 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Dear Fatemeh,

I'm very glad that all is working ok. For parsing the entire folder you just call the function like this:

your_Data <- parse.folder(path, "immunoseq").

I forgot to add "immunoseq" to the parse.folder function, but I just fixed it.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-111800695.

vadimnazarov commented 9 years ago

Hi Fatemeh,

I'm glad that all working now. I don't know why your_data <- parse.folder("/Users/Results", "immunoseq") didn't work in this particular case, maybe there was hidden non-repertoire files in this folder. Please let me know if you encounter any problems with parse.folder or other functions in the future. Also could you tell me please if PCA is working on your data? So I can close this issue.

fkvh commented 9 years ago

Dear Vadim,

Yes, PCA works for my data.

I just have one more question:

For my immunoseq samples I would like to extract top 100 ones sorted by frequencyCount column.

In R when I read the data as:

sample1 = read.table("sample1.tsv", header=TRUE, sep="\t", row.names=NULL, stringsAsFactors=F)

Then sort

sort_sample1 = sample1[order(-sample1$frequencyCount),]

Then select

sample1_100 = sort_sample1[1:100, ]

write.table(sample1_100, file = "sample1_100.tsv", quote=F, sep="\t", row.names = FALSE, col.names=FALSE, qmethod = c("escape", "double"), fileEncoding = "")

Then again there are errors. Still looks like that it reads the first line of data as header. Am I right? Do you have any suggestion?

I am also interested to look at alleles level but we can open a new case. Thanks again for all your patient & kind attention. Best regards, fatemeh

From: Vadim Nazarov notifications@github.com Sent: 15 June 2015 15:46 To: imminfo/tcr Cc: Fatemeh Kaveh Subject: Re: [tcr] pca.segments (ggplot color & names) (#6)

Hi Fatemeh,

I'm glad that all working now. I don't know why your_data <- parse.folder("/Users/Results", "immunoseq") didn't work in this particular case, maybe there was hidden non-repertoire files in this folder. Please let me know if you encounter any problems with parse.folder or other functions in the future. Also could you tell me please if PCA is working on your data? So I can close this issue.

Reply to this email directly or view it on GitHubhttps://github.com/imminfo/tcr/issues/6#issuecomment-112076717.

vadimnazarov commented 9 years ago

Hi Fatemeh,

It seems that error is here: write.table(sample1_100, file = "sample1_100.tsv", quote=F, sep="\t", row.names = FALSE, col.names=FALSE, qmethod = c("escape", "double"), fileEncoding = "") because the function tries to read the first line as a header, as you said, but encounters no good column names. I'm pretty sure that changing col.names=FALSE to col.names=T would fix the problem.

But are there some specific needs to do this? You could directly load ImmunoSEQ data and sort it with R:

# load the data
sample1 = parse.immunoseq("sample1.tsv")
# sort it, but I'm sure that data is already sorted
sample1 <- sample1[order(sample1$Read.proportion, decreasing = T), ]
# get the first 100 rows
sample1.100 <- head(sample1, 100)

imminfo / tcr

pca.segments (ggplot color & names) #6

read.count => count

Percentage => frequencyCount(%)

CDR3.nucleotide.sequence => nucleotide

CDR3.amino.acid.sequence => aminoAcid

V.alleles => vMaxResolved => WE DO NOT NEED IT

V.segments => vGeneName

J.alleles => jMaxResolved => WE DO NOT NEED IT

J.segments => jGeneName

D.alleles => dGeneName => WE DO NOT NEED IT

D.segmenst => dFamilyName

Last.V.nucleotide.position => n1Index

First.D.nucleotide.position => dIndex

Last.D.nucleotide.position => n2Index

First.J.nucleotide.position => jIndex

VD.insertions => n1Insertion

DJ.insertions => n2Insertion

Total.insertions => SUM(n1Insertion + n2Insertion)

1860 14

For new version 2.0 we need to add two NA columns in the begining

1860 16

Now we change the column names based on MiTCR column names:

Note: V.end column should be "V.end-1" or "n1Index -1", e.g. If n1Index is 47, the last V nucleotide position is 47-1=46

Note: D3.end should be "D3.end-1" or "n2Index-1", e.g. If n2Index is 63, the last D nucleotide position is 63-1=62

To sort based on Percentage(%) ("-" does the sorting in reverse order)

1860 16

AS you recommended:

How should I save them?

the same as for Sample2, ....