isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

case sensitivity for reads #13

Closed andreas-wilm closed 8 years ago

andreas-wilm commented 8 years ago

Hi Ivan,

I tried to map a (PacBio) FastQ file with lower case reads (produced by dextract) but none of these reads were mapped by Graphmap (as opposed to Blasr). If I uppercase them, all map. I think I saw an uppercase function for indexing of the reference. But what about reads?

Andreas

isovic commented 8 years ago

Hey Andreas! We force the bases to be uppercase, otherwise we consider them as N's and just skip them. This was a design choice. Why would you have lowercase bases in your reads? Usually these are used for masking regions in reference sequences.

Ivan

andreas-wilm commented 8 years ago

Yes, lowercase is used in the reference for masking. But not in the reads. It just so happens that dextract (https://github.com/thegenemyers/DEXTRACTOR) outputs lower case reads and there's technically nothing wrong with it.

isovic commented 8 years ago

Ok, I see, interesting. But what if you're aligning two references with masking regions (one is the 'reference' and the other a 'read')? A mapper can't differentiate between those without additional command line parameters. It sounds like this should be a preprocessing step - converting all bases to upper caps?

andreas-wilm commented 8 years ago

Good point. I fear though that some people might run into the same problem as I but without knowing what's causing it. How about making reads.upper() the default adding a --no-auto-upper option?

andreas-wilm commented 8 years ago

Since lower-case fastq files should be the exception and your point about reference alignment is valid, how about a warning message should a read be lower case. All lower-case doesn't make sense in any setting if used for masking.

isovic commented 8 years ago

The new version (v0.3.0) now converts all sequences to upper-case by default (there are no special command line parameters to turn this on or off).

dmacmillan commented 6 years ago

@isovic Sorry to revive this old ticket but I found that GraphMap (v0.5.2) still treats lowercase nucleotides differently from uppercase. I tested alignments before and after converting the input reads to uppercase and found a substantial difference in the output. I would have expected them to be identical given what you've stated above. Just thought I'd let you know!