smithlabcode / dnmtools

Tools for analyzing DNA methylation data
https://dnmtools.readthedocs.io
GNU General Public License v3.0
26 stars 9 forks source link

amrfinder output #71

Open maggietsui opened 1 year ago

maggietsui commented 1 year ago

Hello,

I have a question regarding the output from running amrfinder. One of the columns looks something like "+:136". Is this the score for the input region? Also, what does the +/- denote? Thanks.

andrewdavidsmith commented 1 year ago

@maggietsui What version are you using? I'm not able to reproduce any output from amrfinder that seems like what you describe, but you've also only partly described it. Maybe you could paste in a few lines?

maggietsui commented 1 year ago

Apologies, I got the commands mixed up--I actually used amrtester not amrfinder. I'm using version 1.1.0 which was installed using conda. I first ran dnmtools format on WGBS bam files, then dnmtools states on the output to produce a ".epiread" file, then dnmtools amrtester with a bedfile of regions to output a ".amr" file. Here are the first few lines:

NC_000001.11    257864  359681  -:2748  0.0411185   + 
NC_000001.11    359681  361681  -:6 1   +
NC_000001.11    516376  516479  -:10    1   +
NC_000001.11    516479  518479  -:32    1   +
NC_000001.11    1722512 1724512 +:136   0.998533    +
NC_000001.11    1724512 1737251 +:1501  0.914237    +
NC_000001.11    1724838 1745992 -:2038  1   +
NC_000001.11    1745992 1747992 -:92    1   +
NC_000001.11    1921951 2003837 -:8245  0   +

I see that column 4 is the p-value, is column 3 the score?

andrewdavidsmith commented 1 year ago

Your input is wrong. Can you confirm that you provided the mapped reads, converted with "states" and the exact same reference genome as used when mapping reads?

maggietsui commented 1 year ago

Here is the exact command I used:

dnmtools amrtester \
    -c /data1/home/mtsui/allele_specific_methylation/ncbi-genomes-2023-04-26/GCF_000001405.26_GRCh38_genomic.fna \
    -o /data1/home/mtsui/allele_specific_methylation/asm_scores/"$patient"_output.amr \
    /data1/home/mtsui/allele_specific_methylation/regions_to_test/"$patient"_ase_assessable_promoter_genebody_sorted.bed /data1/home/mtsui/allele_specific_methylation/epiread/"$patient".epiread

I wasn't able to obtain the exact same reference that these files were aligned with since they were aligned on a different platform long ago...This one was the closest I could find.

andrewdavidsmith commented 1 year ago

The output formats for amrfinder and amrtester will differ. This issue seems to be about amrtester. In the case of amrtester the output will involve the genomic intervals you give it as input, while amrfinder identifies its own genomic intervals.

But all of these methods need to be able to confirm that the CpG sites within the mapped reads are actually CpG sites. Otherwise they could be just sequencing errors, or somatic variation in the cell sample, or any number of things. So when you run dnmtools states, if you are not using the exact same reference genome, it might give you some pretty arbitrary behaviors when making your epireads.

If you have the original BAM files, there's a good chance you can deduce the reference genome used. If you use samtools view -H it should show you the header with the names of "targets" (chromosomes) and the length of each, in nucleotides. Those lengths should be enough to confirm if the reference was hg38 (likely). But also it would be helpful to use a reference genome with chromosome names rather than the NCBI RefSeq accessions for chromosomes, which I think I see in your example output. I suspect the program would fail if the chromosome names weren't at least the same, so you might have gotten lucky here.

The entries like -:2748 that you see in your output, the - that you see is almost certainly because the input BED format file you are providing has the strand in the wrong column. You seem to have it in the 4th column, but the 4th column of a 6-column BED file is the name of the interval, and the strand should be in the 6th column. Please see this documentation for the definition of BED format https://genome.ucsc.edu/FAQ/FAQformat.html#format1 The "name" must appear if the strand does, and if you decide to use either + or - as the name of your interval, then it will be reported in the output as part of the name (4th column).

andrewdavidsmith commented 1 year ago

@maggietsui sorry -- I was reading your comments backwards. Thanks, I see you mentioned it's amrtester. I think the +/- issue should be explained by the input format of the BED file. Hopefully we can get this resolved and working for you.

maggietsui commented 1 year ago

Thank you so much for the pointers--I accidentally had the strand in the 4th column of my bedfiles instead of the name of the region. Now I am getting an output like this:

NC_000001.11    1722512 1724512 AL031282.1_promoter:136 0.998533    +
NC_000001.11    1724512 1737251 AL031282.1_gene:1501    0.914237    +
NC_000001.11    1724838 1745992 SLC35E2A_gene:2038  1   +
NC_000001.11    1745992 1747992 SLC35E2A_promoter:92    1   +
NC_000001.11    1921951 2003837 CFAP74_gene:8245    0   +

Just to confirm, in the 4th column the number after ":" is the score for that region? Thanks.

andrewdavidsmith commented 1 year ago

I don't think so and can't check right now. I'll try to check out the documentation and update if needed later today. I usually use that for information about the amount of data available for the interval. The "score" you want is in the 5th column and that's the p-value for the null hypothesis of no ASM. The intervals where there is ASM detected will have a low value for this.