dmr says hmr file is not sorted

smithlabcode / dnmtools

Tools for analyzing DNA methylation data

https://dnmtools.readthedocs.io

GNU General Public License v3.0

27 stars 9 forks source link

dmr says hmr file is not sorted #34

Closed msierk closed 6 months ago

msierk commented 1 year ago

I produced hmr files using the following command: dnmtools hmr -o ${METHA}_hypo.bed ${METHA}.sym.meth

However when I run dnmtools dmr I get the error: regions not sorted in file: 231_NO_a_BS_hypo.bed

The hmr output had chrM after chrX, so I tried sorting using: sort -k1,1V -k2,2n 231_NO_a_oxBS_hypo.bed > 231_NO_a_oxBS_hypo.sorted.bed but I still get the same error. I've attached the two bed files. There are some random contigs (like chr14_GL000225v1_random), but as far as I can tell those are sorted correctly in the sorted file.

Can you provide some explanation as to how these files need to be sorted to get dmr to run? Or can you have dmr print out the line where it encounters something it thinks is out of order?

231_NO_a_BS_hypo.bed.txt 231_NO_a_BS_hypo.sorted.bed.txt

andrewdavidsmith commented 1 year ago

@msierk I can look into this very soon, but can you confirm your LOCALE? I personally think I've been sorting using:

sort -k 1,1 -k 2,2g -k 3,3g -k 6,6

for a 6-column BED file with strand in the final position. But I'll check later today to confirm. The LOCALE has historically been a problem for this, as that's the part that makes the difference for chrom order.

msierk commented 1 year ago

I'm on biowulf.

(base) [sierkml@cn4286 bismark2]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

msierk commented 1 year ago

BTW, I don't know that this is related to the sorting issue, but I just discovered that the output from diff does not include any of the sites from chr10-chr22 - it skips straight from chr9 to chrX. I include the transition below, but the whole file is 127MB (and the .meth files it is derived from are 1.4GB). If this is an entirely separate issue I can create a new issue for it.

chr9    138220957       -       CpG     0.652632        7       0       10      1
chr9    138279141       -       CpG     0.6     2       0       1       0
chr9    138279153       -       CpG     0.6     2       0       1       0
chr9    138279202       -       CpG     0.6     2       0       1       0
chrX    10509   +       CpGx    0.285714        2       2       1       0
chrX    10510   -       CpG     0.166667        2       0       14      0
chrX    10574   +       CpG     0.714286        4       0       1       0
chrX    15218   -       CpG     0.5     1       0       1       0

andrewdavidsmith commented 1 year ago

@msierk can you confirm where you obtained your reference genome? I think I know what's going on, but it seems like an issue we had previously solved. If you could send me the following by email it would help: the chromosome order in the reference genome file (grep ^> reference.fa be careful with the >) and the order in the counts file (like cut -f 1 counts.meth.sym | uniq) it would be very helpful.

Edit: considering even more, you might be able to tell me the first file in the pipeline that is missing the chromosomes you mention. Based on your comments above I think you'd know the shell commands to get that info.

andrewdavidsmith commented 6 months ago

@msierk I'm closing this due to inactivity. If you have further problems open another issue.