isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

no reads mapping #73

Open markrobinsonuzh opened 6 years ago

markrobinsonuzh commented 6 years ago

I'm trying to map cDNA reads to the human genome, but I get a SAM file with no mappings.

I think I'm following the instructions; here is my call:

mark@imlstaupo:/graphmap$ graphmap align -t 20 -r GRCh38.genome.fa.gz --gtf Homosapiens.GRCh38.78.gtf -d ../FASTQ/wt2.fastq.gz -o wt2.sam [16:17:07 BuildIndexes] Loading GTF annotations. [16:17:58 BuildIndexes] Loading genomic sequences. [16:18:49 BuildIndexes] Generating the transcriptome. [16:18:49 GenerateTranscriptomeSeqs] Constructing the transcriptome sequences. [16:18:49 GenerateTranscriptomeSeqs] In total, there are 180 transcripts. [16:18:49 SetupIndex] Loading index from file: 'GRCh38.genome.fa.gz.gmidx'. [16:18:49 Index] Memory consumption: [currentRSS = 170 MB, peakRSS = 3433 MB]

[16:18:49 Run] Hits will be thresholded at the percentil value (percentil: 99.000000%, frequency: 2). [16:18:49 Run] Minimizers will be used. Minimizer window length: 5 [16:18:49 Run] Automatically setting the maximum allowed number of regions: max. 500, attempt to reduce after 0 [16:18:49 Run] Reference genome is assumed to be linear. [16:18:49 Run] Only one alignment will be reported per mapped read. [16:18:49 ProcessReads] Reads will be loaded in batches of up to 1024 MB in size. [16:19:05 ProcessReads] Batch of 465975 reads (1024 MiB) loaded in 15.79 sec. (22560984 bases) [16:19:05 ProcessReads] Memory consumption: [currentRSS = 1327 MB, peakRSS = 3433 MB] [16:19:05 ProcessReads] Using 20 threads. [16:40:15 ProcessReads] [CPU time: 24844.55 sec, RSS: 1330 MB] Read: 285278/465975 (61.22%) [m: 0, u: 285259], length = 788, qname: 5af3e7ab-d0a6-4333-8be9-fa8f...

mark@imlstaupo:/graphmap$ grep -v LN wt2.sam | awk '{print $3}' | uniq -c [snip] 269005 * (this was a partial file)

One strange thing is that it says "In total, there are 180 transcripts.". There are of course many thousands. The GTF and genome FASTA are direct from Ensembl. I've independently mapped with gmap, so I also know there are lots of valid mappings (~65%).

Am I doing something wrong? I'm happy to make files available if it would help.

Thanks, Mark

isovic commented 6 years ago

Hi Mark, This is strange. Perhaps it's a data related issue, or a bug in parsing. Could you by any chance provide a direct link to the exact reference file you used and the gtf, so I can try running it locally? Also, a small sample of the unmapped reads would be really helpful.

Thank you! Best regards, Ivan.