lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
29 stars 17 forks source link

exporting twoBit #9

Closed lshep closed 6 years ago

lshep commented 6 years ago

Hi @lawremi When generating the ensmebl 93 twobit files for annotationhub I get the following ERROR for some files:

ERROR [2018-08-01 11:37:15] error processing Cebus_capucinus.Cebus_imitator-1.0.dna_rm.toplevel.fa.gz: One or more strings contain unsupported ambiguity characters.
Strings can contain only A, C, G, T or N.

After speaking with the ensembl helpdesk they informed me of the following:

"We've had a change in policy recently, and have started allowing other ambiguity codes in our reference files, as they add extra information and are also vital for working with CRAM files. If you are using software that cannot work with these codes, you may need to convert them all to Ns yourself."

This seems like it should be therefore checked but corrected for in the rtracklayer::export function where the ERROR above stems from.
https://github.com/lawremi/rtracklayer/blob/master/R/twobit.R#L65

lawremi commented 6 years ago

I think we should make it explicit, i.e., Biostrings could have a function that converts all ambiguity codes to just N. Maybe there is already an easy way to do that. Otherwise, there may be confusion.

lshep commented 6 years ago

@mtmorgan and @hpages thoughts on having the conversion function in Biostrings so it can be implemented elsewhere? This particular case specifically involves DNAStringSet -

hpages commented 6 years ago

You could just use chartr():

replace_ambigs <- function(x)
{   
    old <- paste(setdiff(names(IUPAC_CODE_MAP), DNA_BASES), collapse="")
    new <- strrep("N", nchar(old))
    chartr(old, new, x)
}

Keep in mind that even though the chartr() method for XStringSet objects is highly optimized, it will still generate a copy of the DNAStringSet object. I don't know if this matters in the context of preparing 2bit files for AnnotationHub but that's something that could be avoided if C function TwoBits_write was doing the replacement on the fly.

Note that:

if (any(uniqueLetters(object) %in% unsupported.chars))

would be a slightly more efficient (and easier to read) way of checking for the presence of unsupported chars. That's because it uses alphabetFrequency(object, collapse=TRUE) internally which can make a significant difference when object is a DNAStringSet object with millions of strings.

lawremi commented 6 years ago

Thanks. Might this belong in Biostrings? I could imagine other software wanting a simpler ambiguity encoding.

hpages commented 6 years ago

Done in Biostrings 2.49.1: https://github.com/Bioconductor/Biostrings/commit/b7e3c95e0cc5a89b8d60e19409e0dfbdfdf127ed

Here is the catch though: chartr() is a base R verb and the natural thing to use to replace a set of letters with another. If people can think of using that, then they're almost done and something like replaceAmbiguities() -- which is a trivial 3-line wrapper around chartr() -- doesn't provide much added value for them. And if people don't think or don't know about chartr(), then they probably won't find replaceAmbiguities() either.

lawremi commented 6 years ago

One of the motivations is increased clarity. replaceAmbiguities() should be clearer than the call to chartr(). To enhance discoverability at least for this use case, I will add a reference to the rtracklayer docs.

lawremi commented 6 years ago

Thanks a lot for your help @hpages !

hpages commented 6 years ago

I'm not sure why the export() method couldn't do this replacement automatically (with a warning). If that's not an option, then at least the error message could suggest the use of replaceAmbiguities(). People don't read man pages, especially in the case of the man page for the export,DNAStringSet,character,ANY method which is almost impossible to find for the average user.

Talking about readability, how about using uniqueLetters() for the check instead of

freq <- alphabetFrequency(object)
if (any(rowSums(freq[,unsupported.chars,drop=FALSE]) > 0L)) ...
lawremi commented 6 years ago

Sure, thanks for the pointer to uniqueLetters(), I've clarified the condition. I also added a reference to replaceAmbiguities() to the error message. Good point.

hpages commented 6 years ago

Excellent. Thanks!