smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
67 stars 27 forks source link

Error "chromosome in SAM file not found in reference:chr1" #184

Closed hchetia closed 3 years ago

hchetia commented 3 years ago

I get the error "chromosome in SAM file not found in reference:chr1 "when running methcounts command "methcounts -c /methpipe_reference/mm27 -o 1979.meth 1979.mr.sorted_start" I am using methpipe v4.1.2 alpha. The mr file was generated by converting the deduplicated bam file from bismark to sam using samtools; then converted to mr using format_reads and subsequently sorted to sorted_start.

hchetia commented 3 years ago

Could issue #183 be related and do I have to reinstall my methpipe v4.1.2 alpha (since issue 183 was fixed only 8 days back).?

hchetia commented 3 years ago

@andrewdavidsmith please advice! :)

andrewdavidsmith commented 3 years ago

I think I know the problem, and it should be fixed soon in general. But for now, can you check if the "name" lines of your original fasta format reference genome has text after the first space? I think some of our tools are parsing chromosome names differently. The intended behavior is to ignore text after the first whitespace character, but I think some of the tools are keeping any additional letters that appear on the same line of the fasta file.

hchetia commented 3 years ago

Yes. It looks like this-

chr1 1 NNNN chr2 2 NNNN

I am using /GRCh38.p13.genome.fa. What can do I to solve this problem for now? Would removing or concatenating the headers as chr1_1 help?

hchetia commented 3 years ago

Will try removing the white space using bbtools for now and follow up.

hchetia commented 3 years ago

My bam file comes from aligning to the same genome but with white spaces in headers using bismark. Will that step need to replicated with this modified genome.fasta?

guilhermesena1 commented 3 years ago

Hi @hchetia You do not need to re-map or reanalyze your data, I think the simplest solution that could work is just to remove the spaces from your genome and pass that onto methcounts. A very simple way to do this is using awk. For example, if your genome is called genome.fa, you can run

awk '{print $1}' genome.fa >genome_nospaces.fa

then pass genome_nospaces.fa to methcounts (or bsrate) along with your BAM/SAM/MR file. This should get rid of the problem. Please let us know whether or not it does.

Also just a quick clarification: The MR file format is no longer used in version 4.1.2, so format_reads generates a SAM file output which simply merges paired-end reads and converts SAM files generated from different mappers to a common standard that is then used by the other programs. Please make sure you set the -f flag informat_reads to bismark (e.g. something like format_reads -f bismark -o output_formatted.sam input.bam) so the output is properly formatted accounting for the fact that your SAM file comes from bismark.

hchetia commented 3 years ago

Hereby confirming that this strategy "changing the headers of genome file" worked and methcounts is running successfully. Yes, using .mr was a mistake on my part.

Thanks to both of you. 👍