jtlovell / GENESPACE

Other
191 stars 27 forks source link

Bed file additional field error #108

Closed SwiftSeal closed 1 year ago

SwiftSeal commented 1 year ago

Heya,

I'm running the latest devtools install of GENESPACE and ran into an unexpected error. I prefer to write the bed and peptide directories manually with other scripts - in this case the bed files were produced by https://agat.readthedocs.io/en/latest/gff_to_bed.html. A typical entry might look like:

chr01   69219140        69220392        chr01_000440.1  0       +       69219432        69220118        255,0,0 3       310,231,529     0,404,723

Running gpar <- init_genespace(wd = getwd(), path2mcscanx = "/blah/blah/MCScanX") resulted in the following error:

Checking Working Directory ... PASS: `/blah/blah/blah/`
Checking user-defined parameters ...
        Genome IDs & ploidy ... 
                chilense        : 1
                chmielewskii    : 1
                corneliomulleri : 1
                galapagense     : 1
                habrochaites    : 1
                lycopersicoides : 1
                lycopersicum    : 1
                neorickii       : 1
                peruvianum      : 1
                pimpinellifolium: 1
        Outgroup ... NONE
        n. parallel processes ... 16
        collinear block size ... 5
        collinear block search radius ... 25
        n gaps in collinear block ... 5
        synteny buffer size... 100
        only orthogroups hits as anchors ... TRUE
        n secondary hits ... 0
Checking annotation files (.bed and peptide .fa):
Error in bd$id : $ operator is invalid for atomic vectors

I narrowed the issue down to the read_bed function: https://github.com/jtlovell/GENESPACE/blob/1d7b7274bdb142f321b3c46f3af130084b65fa7f/R/utils.R#L297-L313 It seems to be linked to this issue with fread? https://stackoverflow.com/questions/25691637/using-colclasses-and-select-arguments-of-fread-simultaneously

Testing bd <- fread("morethan4columns.bed", select = 1:4, colClasses = c("character", "numeric", "numeric", "character"), header = FALSE, col.names = c("chr", "start", "end", "id")) results in:

Error in fread("morethan4columns.bed",  : 
  colClasses= is an unnamed vector of types, length 4, but there are 12 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.

Removing colClasses or switching to 4 column bed files fixes this - I've opted to just fix my external bed processing, but highlighting in case anyone else encounters this as it seems the read_bed error seems to fail silently! README says other fields should be ignored also.

Cheers! Fantastic software otherwise : - )

jtlovell commented 1 year ago

Ok. Good sleuthing. So, does it run through if you just strip out the other columns so that it is just the first 4? The GENESPACE input specifications is just a 4-column bed file.

SwiftSeal commented 1 year ago

Aye runs fine with 4 columns! Highlighted just in case anyone else encounters a similar issue as the README suggests additional columns are ignored: https://github.com/jtlovell/GENESPACE/blob/893f9022a76b6d8ef99e0b5e43f4d6b0fdf0e200/readme.md?plain=1#L146