jtlovell / GENESPACE

Other
180 stars 24 forks source link

Reading .bed files fails if the .bed file contains more than four columns. #134

Closed mankiddyman closed 7 months ago

mankiddyman commented 8 months ago

In the [readme.md section 3.1 ](3.1 GENESPACE-readable annotation format For each genome, GENESPACE needs:

bed formatted coordinates of each gene (chr, start, end, name), other fields are allowed, but will be ignored by GENESPACE)it is stated that:

**3.1 GENESPACE-readable annotation format For each genome, GENESPACE needs:

bed formatted coordinates of each gene (chr, start, end, name), other fields are allowed, but will be ignored by GENESPACE**

This suggests that when a .bed file with more than 4 columns is supplied, as long as the ID column matches the .fasta header the other column the presence of extra columns will not affect the reading of the file.

I have used such a file C_australis_wide.bed and received the following error when trying to read it using init_GENESPACE() image $ operator is invalid for atomic vectors Following the call stack I reach the read_bed() function in Utils.R

read_bed <- function(filepath){
  chk <- tryCatch(
    {
      suppressWarnings(suppressMessages(fread(
        filepath, verbose = FALSE, showProgress = FALSE, select = 1:4,
        colClasses = c("character", "numeric", "numeric", "character"),
        header = FALSE, col.names = c("chr", "start", "end", "id"))))
    },
    error = function(err) {
      return(NA)
    }
  )
  if(!is.data.table(chk))
    chk <- subset(chk, complete.cases(chk))

  return(chk)
}

I generated a narrow copy of the .bed file in question containing only the first four columns using the following command cut -f 1-4 C_australis_wide.bed > C_australis_narrow.bed and tested the behaviour of read_bed().

When run on the wide file read_bed() returns logical(0) which causes the subsequent error in init_genespace() which is fixed when using the narrow file.

I believe that the documentation should be updated to reflect that GENESPACE is unable to select the .bed columns purely by itself and that the suppressWarning and suppressmessages in read_bed() should be removed to better inform the user of what the problem is instead of having to follow the call stack to find that the issue was so trivial. I have attached the files for replication purposes.

GENESPACE_bug_report.zip

jtlovell commented 7 months ago

Thanks for this. I will update the documentation. But keep in mind, this is not a generalizable function (yet), and is ad hoc for the format specified in the readme (4-column bed).