slimsuite / chromsyn

Chromosome-level synteny plotting using orthologous regions
GNU General Public License v3.0
28 stars 5 forks source link

Busco v5 format is recognized as v3 #1

Open SaelinB opened 1 year ago

SaelinB commented 1 year ago

Hello, thanks for making this tool! I'm trying to run chromsyn but I get this error:

Rscript chromsyn.R busco=busco.fofn seqeuneces=sequences.fofn focus=scaffold16_size3654323

[Sun Jul 16 11:20:19 2023] #FOFN 12 filenames loaded from sequences.fofn
[Sun Jul 16 11:20:19 2023] #FOFN 12 filenames loaded from busco.fofn
[Sun Jul 16 11:20:19 2023] #FOFN 12 filenames after filtering to recognised genomes.
[Sun Jul 16 11:20:19 2023] Genomes (order=LIST): scaffold13_size4067469, scaffold14_size3830318, scaffold15_size3669401, scaffold16_size3654323, scaffold23_size2881292, scaffold27_size2720374, scaffold28_size2676831, scaffold2_size7801212, scaffold3_size7411349, scaffold5_size6470894, scaffold6_size5997242, scaffold7_size5813896
Joining with `by = join_by(Genome)`
[Sun Jul 16 11:20:19 2023] #GENOME  12 genomes: scaffold13_size4067469, scaffold14_size3830318, scaffold15_size3669401, scaffold16_size3654323, scaffold23_size2881292, scaffold27_size2720374, scaffold28_size2676831, scaffold2_size7801212, scaffold3_size7411349, scaffold5_size6470894, scaffold6_size5997242, scaffold7_size5813896
[Sun Jul 16 11:20:19 2023] scaffold13_size4067469...
[Sun Jul 16 11:20:19 2023] #SEQS 1 scaffold13_size4067469 sequences loaded from gendata/scaffold13_size4067469.telomeres.tdt
[Sun Jul 16 11:20:19 2023] #SEQS 1 scaffold13_size4067469 sequences meet minlen cutoff of 0 bp
[Sun Jul 16 11:20:19 2023] #BUSCOV BUSCO v3 format
Error in names(x) <- value : 
  'names' attribute [7] must be the same length as the vector [2]
Calls: buscoTable -> colnames<-
Execution halted

It seems to think the format is v3, but this is what my scaffold13_size4067469.busco5.tsv looks like:

# BUSCO version is: 5.4.2 
# The lineage dataset is: chlorophyta_odb10 (Creation date: 2020-08-05, number of genomes: 16, number of BUSCOs: 1519)
# Busco id      Status  Sequence        Gene Start      Gene End        Strand  Score   Length  OrthoDB url     Description
15at3041        Missing
42at3041        Missing
45at3041        Missing
52at3041        Missing
etc...

Any ideas on how to fix this?

SaelinB commented 1 year ago

I realized that the problem were the missing columns, and I fixed it by removing "Missing" columns from the busco file. However now I get this issue:


[Sun Jul 16 14:57:09 2023] #BLOCK Generated 10946 synteny blocks.
[Sun Jul 16 14:57:10 2023] #BLOCK Reduced to 156 synteny blocks based on minregion=INT filtering.
`summarise()` has grouped output by 'Genome', 'HitGenome', 'SeqName', 'Hit'. You can override using the `.groups` argument.
`summarise()` has grouped output by 'Genome', 'HitGenome', 'SeqName'. You can override using the `.groups` argument.
[Sun Jul 16 14:57:10 2023] #FOCUS Focal genome for orientation: scaffold16_size3654323
[Sun Jul 16 14:57:11 2023] Warning: problem with missing seqname for scaffold28_size2676831
Error in if (is.na(seqname)) { : argTument is of length zero
Calls: seqRev
Execution halted

I removed scaffold28_size2676831.fa from my analysis just to check, which worked further but then I get different errors about seq names:

[Sun Jul 16 15:00:54 2023] #SAVE All chromsyn data output to chromsyn.xlsx                                                                                                                                                                                     
1 plot(s)...Generating plot...(1) 10%
[Sun Jul 16 15:00:54 2023] Warning: problem with missing seqname for NA                                                                                                                                                   
[Sun Jul 16 15:00:54 2023] Warning: problem with missing seqname for NA                                                                                                                                                                                        
<simpleError in if (fwd) {    pD <- data.frame(x = c(xa1, xa2, xb2, xb1), y = c(ya, ya,         yb, yb))    if (settings$ypad) {        pD <- data.frame(x = c(xa1, xa1, xa2, xa2, xb2, xb2,             xb1, xb1), y = c(ya2, ya, ya, ya2, yb2, yb, yb, yb2)) 
   }    plt <- plt + geom_polygon(data = pD, mapping = aes(x = x,         y = y), fill = "steelblue", color = NA, alpha = settings$opacity)} else {    pD <- data.frame(x = c(xa1, xa2, xb1, xb2), y = c(ya, ya,         yb, yb))    if (settings$ypad) {      
  pD <- data.frame(x = c(xa1, xa1, xa2, xa2, xb1, xb1,             xb2, xb2), y = c(ya2, ya, ya, ya2, yb2, yb, yb, yb2))    }    plt <- plt + geom_polygon(data = pD, mapping = aes(x = x,         y = y), fill = "indianred", color = NA, alpha = settings$opa
city)}: missing value where TRUE/FALSE needed>                                                                                                                                                                                                                 
[Sun Jul 16 15:00:54 2023] #ERROR Error in if (fwd) {: missing value where TRUE/FALSE needed                                                                                                                                                                   
 => splitting plot into 2                                                                                                                                                                                                                                      
2 plot(s)...Generating plot...(1) 16.7%
[Sun Jul 16 15:00:54 2023] Warning: problem with missing seqname for NA                                                                                                                                                 
[Sun Jul 16 15:00:54 2023] Warning: problem with missing seqname for NA                                                                                                                                                                                        
<simpleError in if (fwd) {    pD <- data.frame(x = c(xa1, xa2, xb2, xb1), y = c(ya, ya,         yb, yb))    if (settings$ypad) {        pD <- data.frame(x = c(xa1, xa1, xa2, xa2, xb2, xb2,             xb1, xb1), y = c(ya2, ya, ya, ya2, yb2, yb, yb, yb2)) 
   }    plt <- plt + geom_polygon(data = pD, mapping = aes(x = x,         y = y), fill = "steelblue", color = NA, alpha = settings$opacity)} else {    pD <- data.frame(x = c(xa1, xa2, xb1, xb2), y = c(ya, ya,         yb, yb))    if (settings$ypad) {      
  pD <- data.frame(x = c(xa1, xa1, xa2, xa2, xb1, xb1,             xb2, xb2), y = c(ya2, ya, ya, ya2, yb2, yb, yb, yb2))    }    plt <- plt + geom_polygon(data = pD, mapping = aes(x = x,         y = y), fill = "indianred", color = NA, alpha = settings$opa
city)}: missing value where TRUE/FALSE needed>        

this error keeps going. I am analyzing a single chromosome from different assemblies, so each fasta file has a single sequence

SaelinB commented 1 year ago

Another update.. I've found by testing one-by-one that it is 4 out of the 12 sequences that throw this error when they are included, although I see no differences in the files .. when this error occurs it also creates chromsyn-[1-3].pdf and .png files multiple times

cabbagesofdoom commented 1 year ago

Sorry, I am at a conference and did not see the issue when you raised it. Yes, if the first row of the BUSCO file is Missing then it fails to recognise the right BUSCO version. I will get this bug fixed.

The other error is more intriguing. Would you be OK sharing your input data with me? You can email me at rich.edwards@uwa.edu.au.