Improper header in pheno dataframe with single column pheno file

fdchevalier commented 11 months ago

Dear Karl,

First, thank you very much for having made and maintaining this very useful package.

I came across an unexpected behavior when using the read.cross() function with a phenotype file that has a single column. Despite the column having a "id" header, this header is not present after creating the cross object. This is not the case when the phenotype file has 2 columns.

Here is a reproducible example:

# Mock data
phe <- structure(list(id = c("F2A_1", "F2A_10", "F2A_100", "F2A_101", "F2A_102", "F2A_103")), row.names = c(NA, 6L), class = "data.frame")
gen <- structure(list(id = c("", "", "F2A_1", "F2A_10", "F2A_100", "F2A_101",
            "F2A_102", "F2A_103"), X1 = c("1", "0.781214",
            "LL", "HH", "LL", "HL", "HH", "HL"), X2 = c("1",
            "0.981928", "LL", "HH", "LL", "HL", "HH", "HL"), X3 = c("1",
            "1.060362", "LL", "HH", "LL", "HL", "HH", "HL"), X4 = c("1",
            "1.201365", "LL", "HH", "LL", "HL", "HH", "HL"), X5 = c("1",
            "1.220872", "LL", "HH", "LL", "HL", "HH", "HL")), row.names = c(NA,
            8L), class = "data.frame")

# Write mock data into files
write.table(phe, "phe.csv", row.names = F, quote = F, sep=",")
write.table(cbind(phe, phe), "phe2.csv", row.names = F, quote = F, sep=",")
write.table(gen, "gen.csvs", row.names = F, quote = F, sep=",")

# Create a cross object with a single-column phenotype file
cross1 <- read.cross("csvs", genfile = "gen.csvs", phefile = "phe.csv", estimate.map = FALSE, genotypes = c("LL", "HL", "HH"), alleles = c("L", "H"))
colnames(cross1$pheno)

# Create a cross object with a two-column phenotype file
cross2 <- read.cross("csvs", genfile = "gen.csvs", phefile = "phe2.csv", estimate.map = FALSE, genotypes = c("LL", "HL", "HH"), alleles = c("L", "H"))
colnames(cross2$pheno)

This prevents getid() to work as expected.

Here is my environment details:

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /xxxx/miniconda3/envs/gen_map/lib/libopenblasp-r0.3.24.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] qtl_1.60       magrittr_2.0.3

loaded via a namespace (and not attached):
[1] compiler_4.2.3 parallel_4.2.3 tools_4.2.3

Please let me know if you need more details.

Fred

kbroman commented 11 months ago

Thanks! I really appreciate the excellent example.

The problem is in lines 199-201 of read.cross.csvs.R:

colnames(pheno) <- unlist(pheno[1,])
pheno <- apply(pheno, 2, function(a) { a[!is.na(a) & a==""] <- NA; a })
pheno <- as.data.frame(pheno[-1,], stringsAsFactors=TRUE)

The apply() function with a single-column data frame messes up the column names.

fdchevalier commented 11 months ago

I am glad the example helped.

So, a simple fix could be storing the column names and setting them after the data frame is created. Something like:

pnames <- unlist(pheno[1,])
pheno <- apply(pheno, 2, function(a) { a[!is.na(a) & a==""] <- NA; a })
pheno <- as.data.frame(pheno[-1,], stringsAsFactors=TRUE)
colnames(pheno) <- pnames

Happy to send a PR your way if you would like.

kbroman commented 11 months ago

@fdchevalier I've got it fixed; thanks!

kbroman / qtl

Improper header in pheno dataframe with single column pheno file #103