im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

RDA generation for new species crashes when exon boundary coincides with scaffold boundary #21

Closed MaximilianStammnitz closed 5 years ago

MaximilianStammnitz commented 5 years ago

Hi,

I have noticed that buildref() crashes* when BioMart annotations overlap with either scaffold position no. 1 or the terminal base. It might pose a small problem for non-model organisms with many scaffolds. Removing the (few) genes solves the issue, but probably there's a more elegant solution that doesn't need that.

*An example of a corresponding error message would be:

Error in as.vector(Rsamtools::scanFa(genomefile, gr)) :

error in evaluating the argument 'x' in selecting a method for function 'as.vector': Error in value[3L] :

record 1 (Chr1_supercontig_000003348:0-112) was truncated

file: ../../data/Devil7.1+MT.fa

Best, Max

MaximilianStammnitz commented 5 years ago

The following code helps bypassing the issue by excluding genes affected by this, you may replace the current line 25 in buildref() with:

First contig hits

id.rem.starts <- as.character(cdsfile[cdsfile[,"chr.coding.start"] == 1,'gene.id']) if(length(id.rem.starts) > 0){ warning("Transcript range(s) coinciding with the first base of a contig. The following genes will therefore be removed: ", paste(id.rem.starts, collapse = ', '))
for (i in 1:length(id.rem.starts)){ cdsfile <- cdsfile[-grep(id.rem.starts[i], cdsfile[,'gene.id']),] } }

Last contig base hits

require(Biostrings) validchrs <- readDNAStringSet(genomefile, format="fasta", use.names=TRUE) validchrs.ind <- paste(names(validchrs), width(validchrs), sep = ':') cdsfile.ends.ind <- paste(cdsfile[,'chr'],cdsfile[,'chr.coding.end'],sep=":") id.rem.ends <- as.character(cdsfile[cdsfile.ends.ind %in% validchrs.ind,'gene.id']) if(length(id.rem.ends) > 0){ warning("Transcript range(s) coinciding with the last base of a contig. The following genes will therefore be removed: ", paste(id.rem.ends, collapse = ', '))
for (i in 1:length(id.rem.ends)){ cdsfile <- cdsfile[-grep(id.rem.ends[i], cdsfile[,'gene.id']),] } } validchrs <- names(validchrs)

MaximilianStammnitz commented 5 years ago

Alternatively adding a random base at affected contig positions "0" or "N+1" actually feels quite tricky, for the first case in particular. I don't see an easy way to do this without either manipulating the original fasta genomefile input or doing so in every single instance when the object is loaded. One needs to make sure that this wouldn't permanently offset/shift the positional grid for any other downstream operations in the base change predictions, etc.

im3sanger commented 5 years ago

Well, this one took me a while. The option that I have chosen is to identify the transcripts that start or end at the ends of contigs and then trimming the first or last codon, respectively. I issue a warning when doing so and output a list of the genes affected.

Is there any chance that you could try the new buildref code at some point and report back if you find problems?

Many thanks for raising this issue, Inigo