Closed lshep closed 6 years ago
I think we should make it explicit, i.e., Biostrings could have a function that converts all ambiguity codes to just N. Maybe there is already an easy way to do that. Otherwise, there may be confusion.
@mtmorgan and @hpages thoughts on having the conversion function in Biostrings so it can be implemented elsewhere? This particular case specifically involves DNAStringSet -
You could just use chartr()
:
replace_ambigs <- function(x)
{
old <- paste(setdiff(names(IUPAC_CODE_MAP), DNA_BASES), collapse="")
new <- strrep("N", nchar(old))
chartr(old, new, x)
}
Keep in mind that even though the chartr()
method for XStringSet objects is highly optimized, it will still generate a copy of the DNAStringSet object. I don't know if this matters in the context of preparing 2bit files for AnnotationHub but that's something that could be avoided if C function TwoBits_write
was doing the replacement on the fly.
Note that:
if (any(uniqueLetters(object) %in% unsupported.chars))
would be a slightly more efficient (and easier to read) way of checking for the presence of unsupported chars. That's because it uses alphabetFrequency(object, collapse=TRUE)
internally which can make
a significant difference when object
is a DNAStringSet object with millions of strings.
Thanks. Might this belong in Biostrings? I could imagine other software wanting a simpler ambiguity encoding.
Done in Biostrings 2.49.1: https://github.com/Bioconductor/Biostrings/commit/b7e3c95e0cc5a89b8d60e19409e0dfbdfdf127ed
Here is the catch though: chartr()
is a base R verb and the natural thing to use to replace a set of letters with another. If people can think of using that, then they're almost done and something like replaceAmbiguities()
-- which is a trivial 3-line wrapper around chartr()
-- doesn't provide much added value for them. And if people don't think or don't know about chartr()
, then they probably won't find replaceAmbiguities()
either.
One of the motivations is increased clarity. replaceAmbiguities()
should be clearer than the call to chartr()
. To enhance discoverability at least for this use case, I will add a reference to the rtracklayer docs.
Thanks a lot for your help @hpages !
I'm not sure why the export()
method couldn't do this replacement automatically (with a warning). If that's not an option, then at least the error message could suggest the use of replaceAmbiguities()
. People don't read man pages, especially in the case of the man page for the export,DNAStringSet,character,ANY
method which is almost impossible to find for the average user.
Talking about readability, how about using uniqueLetters()
for the check instead of
freq <- alphabetFrequency(object)
if (any(rowSums(freq[,unsupported.chars,drop=FALSE]) > 0L)) ...
Sure, thanks for the pointer to uniqueLetters()
, I've clarified the condition. I also added a reference to replaceAmbiguities()
to the error message. Good point.
Excellent. Thanks!
Hi @lawremi When generating the ensmebl 93 twobit files for annotationhub I get the following ERROR for some files:
After speaking with the ensembl helpdesk they informed me of the following:
"We've had a change in policy recently, and have started allowing other ambiguity codes in our reference files, as they add extra information and are also vital for working with CRAM files. If you are using software that cannot work with these codes, you may need to convert them all to Ns yourself."
This seems like it should be therefore checked but corrected for in the rtracklayer::export function where the ERROR above stems from.
https://github.com/lawremi/rtracklayer/blob/master/R/twobit.R#L65