labeling of samples in PCA function from xcms tutorial

gmhhope commented 3 years ago

Hi Johanna,

I observed something that I am really concerned about in the PCA function (https://www.bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html). I sincerely hope you can help me address it.

In the xdata, the data are randomized as you can see here:

> xdata$sample_name
 [1] "HEU_HUU_082_001" "HEU_HUU_166_001" "HEU_HEU_234_001" "HEU_HEU_210_001" "HEU_HUU_169_001" "HEU_HEU_223_001" "HEU_HEU_188_001"
 [8] "HEU_HEU_229_001" "HEU_HEU_190_001" "HEU_HUU_164_001" "HEU_HUU_152_001" "HEU_HUU_161_001" "HEU_HUU_167_001" "HEU_HEU_078_001"
[15] "HEU_HUU_149_001" "HEU_HUU_081_001" "HEU_HUU_162_001" "HEU_HEU_075_001" "HEU_HEU_203_001" "HEU_HEU_232_001" "HEU_HUU_080_001"
[22] "HEU_HEU_231_001" "HEU_HUU_154_001" "HEU_HUU_153_001" "HEU_HEU_195_001" "HEU_HEU_222_001" "HEU_HUU_168_001" "HEU_HEU_194_001"
[29] "HEU_HUU_155_001" "HEU_HEU_189_001" "HEU_HEU_191_001" "HEU_HUU_170_001" "HEU_HUU_160_001" "HEU_HEU_235_001" "HEU_HUU_158_001"
[36] "HEU_HUU_165_001" "HEU_HUU_171_001" "HEU_HEU_079_001" "HEU_HEU_067_001" "HEU_HEU_226_001"

> xdata$sample_group
[1] "HUU" "HUU" "HEU" "HEU" "HUU" "HEU" "HEU" "HEU" "HEU" "HUU" "HUU" "HUU" "HUU" "HEU" "HUU" "HUU" "HUU" "HEU" "HEU" "HEU" "HUU" "HEU"
[23] "HUU" "HUU" "HEU" "HEU" "HUU" "HEU" "HUU" "HEU" "HEU" "HUU" "HUU" "HEU" "HUU" "HUU" "HUU" "HEU" "HEU" "HEU"

However, when I do either quantify(xdata, value = "into") or featureValues(xdata, value = "into") the colnames of the dataframe it turns out is:

> colnames(assay(res,'raw'))
[1] "HEU_HEU_067_001.mzML" "HEU_HEU_075_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_188_001.mzML"
 [6] "HEU_HEU_189_001.mzML" "HEU_HEU_190_001.mzML" "HEU_HEU_191_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HEU_195_001.mzML"
[11] "HEU_HEU_203_001.mzML" "HEU_HEU_210_001.mzML" "HEU_HEU_222_001.mzML" "HEU_HEU_223_001.mzML" "HEU_HEU_226_001.mzML"
[16] "HEU_HEU_229_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HEU_232_001.mzML" "HEU_HEU_234_001.mzML" "HEU_HEU_235_001.mzML"
[21] "HEU_HUU_080_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_082_001.mzML" "HEU_HUU_149_001.mzML" "HEU_HUU_152_001.mzML"
[26] "HEU_HUU_153_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_160_001.mzML"
[31] "HEU_HUU_161_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_165_001.mzML" "HEU_HUU_166_001.mzML"
[36] "HEU_HUU_167_001.mzML" "HEU_HUU_168_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_171_001.mzML"

So as you can see, the ordering is different. the PCA script using the

> res <- quantify(xdata, value = "into")
> ft_ints <- log2(assay(res, "raw"))
> pc <- prcomp(t(na.omit(ft_ints)), center = TRUE)
> cols <- group_colors[xdata$sample_group]

The result use the dataframe for PCA, but the labeling inherit the sample_group order in XCMS object xdata. However, the labelings of dataframe have been reordered based on alphabet?!

I don't know if I have done something wrong, or maybe the colnames of the dataframe are wrong? Please help me with this.

Thanks very much! I do appreciate all your assistance!

Best regards, Minghao Gong

jorainer commented 3 years ago

Hm, indeed that does not look good. Can you check also if basename(fileNames(xdata)) matches with what you get from pData(xdata)? Each row in pData(xdata) is supposed to describe the sample in file fileNames(xdata).

gmhhope commented 3 years ago

> pData(xdata)
       sample_name sample_group
1  HEU_HUU_082_001          HUU
2  HEU_HUU_166_001          HUU
3  HEU_HEU_234_001          HEU
4  HEU_HEU_210_001          HEU
5  HEU_HUU_169_001          HUU
6  HEU_HEU_223_001          HEU
7  HEU_HEU_188_001          HEU
8  HEU_HEU_229_001          HEU
9  HEU_HEU_190_001          HEU
10 HEU_HUU_164_001          HUU
11 HEU_HUU_152_001          HUU
12 HEU_HUU_161_001          HUU
13 HEU_HUU_167_001          HUU
14 HEU_HEU_078_001          HEU
15 HEU_HUU_149_001          HUU
16 HEU_HUU_081_001          HUU
17 HEU_HUU_162_001          HUU
18 HEU_HEU_075_001          HEU
19 HEU_HEU_203_001          HEU
20 HEU_HEU_232_001          HEU
21 HEU_HUU_080_001          HUU
22 HEU_HEU_231_001          HEU
23 HEU_HUU_154_001          HUU
24 HEU_HUU_153_001          HUU
25 HEU_HEU_195_001          HEU
26 HEU_HEU_222_001          HEU
27 HEU_HUU_168_001          HUU
28 HEU_HEU_194_001          HEU
29 HEU_HUU_155_001          HUU
30 HEU_HEU_189_001          HEU
31 HEU_HEU_191_001          HEU
32 HEU_HUU_170_001          HUU
33 HEU_HUU_160_001          HUU
34 HEU_HEU_235_001          HEU
35 HEU_HUU_158_001          HUU
36 HEU_HUU_165_001          HUU
37 HEU_HUU_171_001          HUU
38 HEU_HEU_079_001          HEU
39 HEU_HEU_067_001          HEU
40 HEU_HEU_226_001          HEU

> basename(fileNames(xdata))
 [1] "HEU_HEU_067_001.mzML" "HEU_HEU_075_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_188_001.mzML" "HEU_HEU_189_001.mzML"
 [7] "HEU_HEU_190_001.mzML" "HEU_HEU_191_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HEU_195_001.mzML" "HEU_HEU_203_001.mzML" "HEU_HEU_210_001.mzML"
[13] "HEU_HEU_222_001.mzML" "HEU_HEU_223_001.mzML" "HEU_HEU_226_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HEU_232_001.mzML"
[19] "HEU_HEU_234_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_082_001.mzML" "HEU_HUU_149_001.mzML"
[25] "HEU_HUU_152_001.mzML" "HEU_HUU_153_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_160_001.mzML"
[31] "HEU_HUU_161_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_165_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HUU_167_001.mzML"
[37] "HEU_HUU_168_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_171_001.mzML"

@jorainer It looks like they don't match. Please help further to avoid any mistakes I made. I recently tried to wrap up some of the data for collaborators.

Thanks, Minghao

gmhhope commented 3 years ago

I tried to avoid the mislabeling in PCA by rearranging the columns of tables to match the order of xdata$sample_name. However, I am concerned if the colnames of ft_ints were wrong in itself, which at this stage I gave full trust on the quantify function. Thanks very much for your help!

> ft_ints <- log2(assay(res, "raw")) 
> colnames(ft_ints)
[1] "HEU_HEU_067_001.mzML" "HEU_HEU_075_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_188_001.mzML" "HEU_HEU_189_001.mzML"
 [7] "HEU_HEU_190_001.mzML" "HEU_HEU_191_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HEU_195_001.mzML" "HEU_HEU_203_001.mzML" "HEU_HEU_210_001.mzML"
[13] "HEU_HEU_222_001.mzML" "HEU_HEU_223_001.mzML" "HEU_HEU_226_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HEU_232_001.mzML"
[19] "HEU_HEU_234_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_082_001.mzML" "HEU_HUU_149_001.mzML"
[25] "HEU_HUU_152_001.mzML" "HEU_HUU_153_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_160_001.mzML"
[31] "HEU_HUU_161_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_165_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HUU_167_001.mzML"
[37] "HEU_HUU_168_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_171_001.mzML"

> ft_ints <- ft_ints[,paste(xdata$sample_name,".mzML", sep = "")]
> colnames(ft_ints)
 [1] "HEU_HUU_082_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HEU_234_001.mzML" "HEU_HEU_210_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HEU_223_001.mzML"
 [7] "HEU_HEU_188_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_190_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_152_001.mzML" "HEU_HUU_161_001.mzML"
[13] "HEU_HUU_167_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HUU_149_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HEU_075_001.mzML"
[19] "HEU_HEU_203_001.mzML" "HEU_HEU_232_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_153_001.mzML"
[25] "HEU_HEU_195_001.mzML" "HEU_HEU_222_001.mzML" "HEU_HUU_168_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HEU_189_001.mzML"
[31] "HEU_HEU_191_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_160_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_165_001.mzML"
[37] "HEU_HUU_171_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_067_001.mzML" "HEU_HEU_226_001.mzML"

jorainer commented 3 years ago

What I would suggest is that you define a data table (in excel or as a csv file or whatever) that contains the names of the files of your experiment as well as additional information (like group etc) on each of them. In your case it could have the 3 columns "file_name", "sample_name" and "sample_group". you can then read this table in (either with read.table if its a csv file or with read_xlsx from the readxl package if it's an xlsx file - just be sure to use as.data.frame if you use read_xlsx, e.g. pd <- as.data.frame(read_xlsx(... your xlsx file...))

the code could be look similar to below:

path <- "...path to the mzML files..."
pd <- read.table("... your file table...", header = TRUE)
data <- readMSData(paste0(path, "/", pd$file_name), pdata = new("NAnnotatedDataFrame", pd))

that way you can be sure that the sample description fit the sample files. In your case it seems that there is a misalignment

gmhhope commented 3 years ago

I think I follow what you did in the tutorial like this:

folder_path = "Path/to/my/mzML/folder/"
mzMLs <- list.files(folder_path, full.names=TRUE)

print(mzMLs[1:6]) 
[1] "/../HILICpos_exp//HEU_HEU_067_001.mzML"
[2] "/../HILICpos_exp//HEU_HEU_075_001.mzML"
[3] "/../HILICpos_exp//HEU_HEU_078_001.mzML"
[4] "/../HILICpos_exp//HEU_HEU_079_001.mzML"
[5] "/../HILICpos_exp//HEU_HEU_188_001.mzML"
[6] "/../HILICpos_exp//HEU_HEU_189_001.mzML"

Here by using list.files, the file order is following alphabet.

metadata <- read.csv("Path/to/my/sequence/csv/file", stringsAsFactors = FALSE)
colnames(metadata)

pd = metadata[,c("Sample.ID","group")]
colnames(pd) = c("sample_name","sample_group")

head(pd)
      sample_name sample_group
1 HEU_HUU_082_001          HUU
2 HEU_HUU_166_001          HUU
3 HEU_HEU_234_001          HEU
4 HEU_HEU_210_001          HEU
5 HEU_HUU_169_001          HUU
6 HEU_HEU_223_001          HEU

raw_data <- readMSData(files = mzMLs, pdata = new("NAnnotatedDataFrame", pd),
                       mode = "onDisk")

However, the pdata is definitely following the metadata sequence order. Will these two matched together inside the program by its file names. As you can see the sample_name in pd is matched with the file names (excluding the string .mzML) in the mzML folder.

Please do try let me know what you think as soon as possible. It is crucial to make sure i don't get the wrong labelled feature tables. That is very important for me. Thanks a lot for your continuous assistance! Very appreciate it!

gmhhope commented 3 years ago

The only difference I observe here is you have the the file type(e.g., "HEU_HEU_067_001.mzML") included in pd. And I didn't include the file extension in the pd ("HEU_HEU_067_001"). However, I didn't receive any warnings or errors saying failing to align pd with the files when I perform the following line.

raw_data <- readMSData(files = mzMLs, pdata = new("NAnnotatedDataFrame", pd),
                       mode = "onDisk")

That can be very misleading if that is where the problem. Let me know. Thanks very much!

Best, Minghao

jorainer commented 3 years ago

with list.files(folder_path, full.names=TRUE) you get all mzML files in alphabetic order. In your pd you have the samples not in alphabetic order. It is common practive to have a (tab delimited text, csv or xlsx) file with all experimental data files and the sample description of all these files. The readMSData will not check for any potential mis-alignments between sample names and mzML files - it's the responsibility of the user to ensure that this is correct. By having the mzML files also defined in the file table and then using these with readMSData(paste0(path, "/", pd$file_name), pdata = new("NAnnotatedDataFrame", pd)) as I suggested above you can avoid this problem.

I will update the tutorial and the vignette to clarify this also there.

gmhhope commented 3 years ago

Thanks Johannes,

I have been busy last week so I didn't get chances to reply. I think it is pretty clear now and you can close this issue if you want.

Thanks again for the help!

Best, Minghao Gong

sneumann / xcms

labeling of samples in PCA function from xcms tutorial #553