Closed MaximilianStammnitz closed 5 years ago
The following code helps bypassing the issue by excluding genes affected by this, you may replace the current line 25 in buildref() with:
id.rem.starts <- as.character(cdsfile[cdsfile[,"chr.coding.start"] == 1,'gene.id'])
if(length(id.rem.starts) > 0){
warning("Transcript range(s) coinciding with the first base of a contig. The following genes will therefore be removed: ", paste(id.rem.starts, collapse = ', '))
for (i in 1:length(id.rem.starts)){
cdsfile <- cdsfile[-grep(id.rem.starts[i], cdsfile[,'gene.id']),]
}
}
require(Biostrings)
validchrs <- readDNAStringSet(genomefile, format="fasta", use.names=TRUE)
validchrs.ind <- paste(names(validchrs), width(validchrs), sep = ':')
cdsfile.ends.ind <- paste(cdsfile[,'chr'],cdsfile[,'chr.coding.end'],sep=":")
id.rem.ends <- as.character(cdsfile[cdsfile.ends.ind %in% validchrs.ind,'gene.id'])
if(length(id.rem.ends) > 0){
warning("Transcript range(s) coinciding with the last base of a contig. The following genes will therefore be removed: ", paste(id.rem.ends, collapse = ', '))
for (i in 1:length(id.rem.ends)){
cdsfile <- cdsfile[-grep(id.rem.ends[i], cdsfile[,'gene.id']),]
}
}
validchrs <- names(validchrs)
Alternatively adding a random base at affected contig positions "0" or "N+1" actually feels quite tricky, for the first case in particular. I don't see an easy way to do this without either manipulating the original fasta genomefile input or doing so in every single instance when the object is loaded. One needs to make sure that this wouldn't permanently offset/shift the positional grid for any other downstream operations in the base change predictions, etc.
Well, this one took me a while. The option that I have chosen is to identify the transcripts that start or end at the ends of contigs and then trimming the first or last codon, respectively. I issue a warning when doing so and output a list of the genes affected.
Is there any chance that you could try the new buildref code at some point and report back if you find problems?
Many thanks for raising this issue, Inigo
Hi,
I have noticed that buildref() crashes* when BioMart annotations overlap with either scaffold position no. 1 or the terminal base. It might pose a small problem for non-model organisms with many scaffolds. Removing the (few) genes solves the issue, but probably there's a more elegant solution that doesn't need that.
*An example of a corresponding error message would be:
Error in as.vector(Rsamtools::scanFa(genomefile, gr)) :
error in evaluating the argument 'x' in selecting a method for function 'as.vector': Error in value[3L] :
record 1 (Chr1_supercontig_000003348:0-112) was truncated
file: ../../data/Devil7.1+MT.fa
Best, Max