kbroman / qtl

R/qtl: A QTL mapping environment
https://rqtl.org
GNU General Public License v3.0
77 stars 45 forks source link

Improper header in pheno dataframe with single column pheno file #103

Closed fdchevalier closed 11 months ago

fdchevalier commented 11 months ago

Dear Karl,

First, thank you very much for having made and maintaining this very useful package.

I came across an unexpected behavior when using the read.cross() function with a phenotype file that has a single column. Despite the column having a "id" header, this header is not present after creating the cross object. This is not the case when the phenotype file has 2 columns.

Here is a reproducible example:

# Mock data
phe <- structure(list(id = c("F2A_1", "F2A_10", "F2A_100", "F2A_101", "F2A_102", "F2A_103")), row.names = c(NA, 6L), class = "data.frame")
gen <- structure(list(id = c("", "", "F2A_1", "F2A_10", "F2A_100", "F2A_101",
            "F2A_102", "F2A_103"), X1 = c("1", "0.781214",
            "LL", "HH", "LL", "HL", "HH", "HL"), X2 = c("1",
            "0.981928", "LL", "HH", "LL", "HL", "HH", "HL"), X3 = c("1",
            "1.060362", "LL", "HH", "LL", "HL", "HH", "HL"), X4 = c("1",
            "1.201365", "LL", "HH", "LL", "HL", "HH", "HL"), X5 = c("1",
            "1.220872", "LL", "HH", "LL", "HL", "HH", "HL")), row.names = c(NA,
            8L), class = "data.frame")

# Write mock data into files
write.table(phe, "phe.csv", row.names = F, quote = F, sep=",")
write.table(cbind(phe, phe), "phe2.csv", row.names = F, quote = F, sep=",")
write.table(gen, "gen.csvs", row.names = F, quote = F, sep=",")

# Create a cross object with a single-column phenotype file
cross1 <- read.cross("csvs", genfile = "gen.csvs", phefile = "phe.csv", estimate.map = FALSE, genotypes = c("LL", "HL", "HH"), alleles = c("L", "H"))
colnames(cross1$pheno)

# Create a cross object with a two-column phenotype file
cross2 <- read.cross("csvs", genfile = "gen.csvs", phefile = "phe2.csv", estimate.map = FALSE, genotypes = c("LL", "HL", "HH"), alleles = c("L", "H"))
colnames(cross2$pheno)

This prevents getid() to work as expected.

Here is my environment details:

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /xxxx/miniconda3/envs/gen_map/lib/libopenblasp-r0.3.24.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] qtl_1.60       magrittr_2.0.3

loaded via a namespace (and not attached):
[1] compiler_4.2.3 parallel_4.2.3 tools_4.2.3

Please let me know if you need more details.

Fred

kbroman commented 11 months ago

Thanks! I really appreciate the excellent example.

The problem is in lines 199-201 of read.cross.csvs.R:

colnames(pheno) <- unlist(pheno[1,])
pheno <- apply(pheno, 2, function(a) { a[!is.na(a) & a==""] <- NA; a })
pheno <- as.data.frame(pheno[-1,], stringsAsFactors=TRUE)

The apply() function with a single-column data frame messes up the column names.

fdchevalier commented 11 months ago

I am glad the example helped.

So, a simple fix could be storing the column names and setting them after the data frame is created. Something like:

pnames <- unlist(pheno[1,])
pheno <- apply(pheno, 2, function(a) { a[!is.na(a) & a==""] <- NA; a })
pheno <- as.data.frame(pheno[-1,], stringsAsFactors=TRUE)
colnames(pheno) <- pnames

Happy to send a PR your way if you would like.

kbroman commented 11 months ago

@fdchevalier I've got it fixed; thanks!