Closed gmhhope closed 3 years ago
Hm, indeed that does not look good. Can you check also if basename(fileNames(xdata))
matches with what you get from pData(xdata)
? Each row in pData(xdata)
is supposed to describe the sample in file fileNames(xdata)
.
> pData(xdata)
sample_name sample_group
1 HEU_HUU_082_001 HUU
2 HEU_HUU_166_001 HUU
3 HEU_HEU_234_001 HEU
4 HEU_HEU_210_001 HEU
5 HEU_HUU_169_001 HUU
6 HEU_HEU_223_001 HEU
7 HEU_HEU_188_001 HEU
8 HEU_HEU_229_001 HEU
9 HEU_HEU_190_001 HEU
10 HEU_HUU_164_001 HUU
11 HEU_HUU_152_001 HUU
12 HEU_HUU_161_001 HUU
13 HEU_HUU_167_001 HUU
14 HEU_HEU_078_001 HEU
15 HEU_HUU_149_001 HUU
16 HEU_HUU_081_001 HUU
17 HEU_HUU_162_001 HUU
18 HEU_HEU_075_001 HEU
19 HEU_HEU_203_001 HEU
20 HEU_HEU_232_001 HEU
21 HEU_HUU_080_001 HUU
22 HEU_HEU_231_001 HEU
23 HEU_HUU_154_001 HUU
24 HEU_HUU_153_001 HUU
25 HEU_HEU_195_001 HEU
26 HEU_HEU_222_001 HEU
27 HEU_HUU_168_001 HUU
28 HEU_HEU_194_001 HEU
29 HEU_HUU_155_001 HUU
30 HEU_HEU_189_001 HEU
31 HEU_HEU_191_001 HEU
32 HEU_HUU_170_001 HUU
33 HEU_HUU_160_001 HUU
34 HEU_HEU_235_001 HEU
35 HEU_HUU_158_001 HUU
36 HEU_HUU_165_001 HUU
37 HEU_HUU_171_001 HUU
38 HEU_HEU_079_001 HEU
39 HEU_HEU_067_001 HEU
40 HEU_HEU_226_001 HEU
> basename(fileNames(xdata))
[1] "HEU_HEU_067_001.mzML" "HEU_HEU_075_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_188_001.mzML" "HEU_HEU_189_001.mzML"
[7] "HEU_HEU_190_001.mzML" "HEU_HEU_191_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HEU_195_001.mzML" "HEU_HEU_203_001.mzML" "HEU_HEU_210_001.mzML"
[13] "HEU_HEU_222_001.mzML" "HEU_HEU_223_001.mzML" "HEU_HEU_226_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HEU_232_001.mzML"
[19] "HEU_HEU_234_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_082_001.mzML" "HEU_HUU_149_001.mzML"
[25] "HEU_HUU_152_001.mzML" "HEU_HUU_153_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_160_001.mzML"
[31] "HEU_HUU_161_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_165_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HUU_167_001.mzML"
[37] "HEU_HUU_168_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_171_001.mzML"
@jorainer It looks like they don't match. Please help further to avoid any mistakes I made. I recently tried to wrap up some of the data for collaborators.
Thanks, Minghao
I tried to avoid the mislabeling in PCA by rearranging the columns of tables to match the order of xdata$sample_name
. However, I am concerned if the colnames of ft_ints
were wrong in itself, which at this stage I gave full trust on the quantify
function. Thanks very much for your help!
> ft_ints <- log2(assay(res, "raw"))
> colnames(ft_ints)
[1] "HEU_HEU_067_001.mzML" "HEU_HEU_075_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_188_001.mzML" "HEU_HEU_189_001.mzML"
[7] "HEU_HEU_190_001.mzML" "HEU_HEU_191_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HEU_195_001.mzML" "HEU_HEU_203_001.mzML" "HEU_HEU_210_001.mzML"
[13] "HEU_HEU_222_001.mzML" "HEU_HEU_223_001.mzML" "HEU_HEU_226_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HEU_232_001.mzML"
[19] "HEU_HEU_234_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_082_001.mzML" "HEU_HUU_149_001.mzML"
[25] "HEU_HUU_152_001.mzML" "HEU_HUU_153_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_160_001.mzML"
[31] "HEU_HUU_161_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_165_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HUU_167_001.mzML"
[37] "HEU_HUU_168_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_171_001.mzML"
> ft_ints <- ft_ints[,paste(xdata$sample_name,".mzML", sep = "")]
> colnames(ft_ints)
[1] "HEU_HUU_082_001.mzML" "HEU_HUU_166_001.mzML" "HEU_HEU_234_001.mzML" "HEU_HEU_210_001.mzML" "HEU_HUU_169_001.mzML" "HEU_HEU_223_001.mzML"
[7] "HEU_HEU_188_001.mzML" "HEU_HEU_229_001.mzML" "HEU_HEU_190_001.mzML" "HEU_HUU_164_001.mzML" "HEU_HUU_152_001.mzML" "HEU_HUU_161_001.mzML"
[13] "HEU_HUU_167_001.mzML" "HEU_HEU_078_001.mzML" "HEU_HUU_149_001.mzML" "HEU_HUU_081_001.mzML" "HEU_HUU_162_001.mzML" "HEU_HEU_075_001.mzML"
[19] "HEU_HEU_203_001.mzML" "HEU_HEU_232_001.mzML" "HEU_HUU_080_001.mzML" "HEU_HEU_231_001.mzML" "HEU_HUU_154_001.mzML" "HEU_HUU_153_001.mzML"
[25] "HEU_HEU_195_001.mzML" "HEU_HEU_222_001.mzML" "HEU_HUU_168_001.mzML" "HEU_HEU_194_001.mzML" "HEU_HUU_155_001.mzML" "HEU_HEU_189_001.mzML"
[31] "HEU_HEU_191_001.mzML" "HEU_HUU_170_001.mzML" "HEU_HUU_160_001.mzML" "HEU_HEU_235_001.mzML" "HEU_HUU_158_001.mzML" "HEU_HUU_165_001.mzML"
[37] "HEU_HUU_171_001.mzML" "HEU_HEU_079_001.mzML" "HEU_HEU_067_001.mzML" "HEU_HEU_226_001.mzML"
What I would suggest is that you define a data table (in excel or as a csv file or whatever) that contains the names of the files of your experiment as well as additional information (like group etc) on each of them. In your case it could have the 3 columns "file_name", "sample_name" and "sample_group".
you can then read this table in (either with read.table
if its a csv file or with read_xlsx
from the readxl
package if it's an xlsx file - just be sure to use as.data.frame
if you use read_xlsx
, e.g. pd <- as.data.frame(read_xlsx(... your xlsx file...))
the code could be look similar to below:
path <- "...path to the mzML files..."
pd <- read.table("... your file table...", header = TRUE)
data <- readMSData(paste0(path, "/", pd$file_name), pdata = new("NAnnotatedDataFrame", pd))
that way you can be sure that the sample description fit the sample files. In your case it seems that there is a misalignment
I think I follow what you did in the tutorial like this:
folder_path = "Path/to/my/mzML/folder/"
mzMLs <- list.files(folder_path, full.names=TRUE)
print(mzMLs[1:6])
[1] "/../HILICpos_exp//HEU_HEU_067_001.mzML"
[2] "/../HILICpos_exp//HEU_HEU_075_001.mzML"
[3] "/../HILICpos_exp//HEU_HEU_078_001.mzML"
[4] "/../HILICpos_exp//HEU_HEU_079_001.mzML"
[5] "/../HILICpos_exp//HEU_HEU_188_001.mzML"
[6] "/../HILICpos_exp//HEU_HEU_189_001.mzML"
Here by using list.files
, the file order is following alphabet.
metadata <- read.csv("Path/to/my/sequence/csv/file", stringsAsFactors = FALSE)
colnames(metadata)
pd = metadata[,c("Sample.ID","group")]
colnames(pd) = c("sample_name","sample_group")
head(pd)
sample_name sample_group
1 HEU_HUU_082_001 HUU
2 HEU_HUU_166_001 HUU
3 HEU_HEU_234_001 HEU
4 HEU_HEU_210_001 HEU
5 HEU_HUU_169_001 HUU
6 HEU_HEU_223_001 HEU
raw_data <- readMSData(files = mzMLs, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
However, the pdata is definitely following the metadata sequence order. Will these two matched together inside the program by its file names. As you can see the sample_name
in pd
is matched with the file names (excluding the string .mzML
) in the mzML folder.
Please do try let me know what you think as soon as possible. It is crucial to make sure i don't get the wrong labelled feature tables. That is very important for me. Thanks a lot for your continuous assistance! Very appreciate it!
The only difference I observe here is you have the the file type(e.g., "HEU_HEU_067_001.mzML") included in pd
. And I didn't include the file extension in the pd
("HEU_HEU_067_001"). However, I didn't receive any warnings or errors saying failing to align pd
with the files when I perform the following line.
raw_data <- readMSData(files = mzMLs, pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk")
That can be very misleading if that is where the problem. Let me know. Thanks very much!
Best, Minghao
with list.files(folder_path, full.names=TRUE)
you get all mzML files in alphabetic order. In your pd
you have the samples not in alphabetic order. It is common practive to have a (tab delimited text, csv or xlsx) file with all experimental data files and the sample description of all these files. The readMSData
will not check for any potential mis-alignments between sample names and mzML files - it's the responsibility of the user to ensure that this is correct. By having the mzML files also defined in the file table and then using these with readMSData(paste0(path, "/", pd$file_name), pdata = new("NAnnotatedDataFrame", pd))
as I suggested above you can avoid this problem.
I will update the tutorial and the vignette to clarify this also there.
Thanks Johannes,
I have been busy last week so I didn't get chances to reply. I think it is pretty clear now and you can close this issue if you want.
Thanks again for the help!
Best, Minghao Gong
Hi Johanna,
I observed something that I am really concerned about in the PCA function (https://www.bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html). I sincerely hope you can help me address it.
In the
xdata
, the data are randomized as you can see here:However, when I do either
quantify(xdata, value = "into")
orfeatureValues(xdata, value = "into")
the colnames of the dataframe it turns out is:So as you can see, the ordering is different. the PCA script using the
The result use the dataframe for PCA, but the labeling inherit the sample_group order in XCMS object xdata. However, the labelings of dataframe have been reordered based on alphabet?!
I don't know if I have done something wrong, or maybe the colnames of the dataframe are wrong? Please help me with this.
Thanks very much! I do appreciate all your assistance!
Best regards, Minghao Gong