streetslab / dimelo

python package for analysis of dimelo-seq & nanopore modified base data
MIT License
3 stars 5 forks source link

Problem getting methylation by base data in non chromosome X and Y from diploid bam #34

Closed yxu405 closed 1 year ago

yxu405 commented 1 year ago

Hi! I have a diploid HG002 dimelo-seq bam file. After running both of these commands: dm.parse_bam("/Data1/secphased_mat_pat/winnowmap_verkko.sorted.secphase_v0.2.0.merge.sorted.bam", "" , "/Data1/secphased_mat_pat/parsed", region = "chrX_MATERNAL:57569220-57575033", basemod="A", threshA=190, cores=8)

dm.parse_bam("/Data1/secphased_mat_pat/winnowmap_verkko.sorted.secphase_v0.2.0.merge.sorted.bam", "chr10" , "/Data1/secphased_mat_pat/parsed", region = "chr10_MATERNAL:710331-710427", basemod="A", threshA=190, cores=8)

I was able to get the by base and aggregate data for chr X, the aggregate data for chr 10. However, the methylation by base data for chr 10 is empty.

The bam file should have all the methylation information, and read id information. Is there a parameter I have to set for this specific issue?

Thank you

thekugelmeister commented 1 year ago

Hi!

That is very strange. Both tables should be populated by parse_bam both times. I did a quick test on my end running the exact commands you provided (including the empty sampleName argument for chrX), albeit with a different input BAM and different regions, and everything worked as expected.

My best guess is that the chr10 part did not actually run to completion, possibly due to a memory issue. Can you verify whether there are .bed output files for both of the regions in the output directory "/Data1/secphased_mat_pat/parsed"? Alternatively, can you check whether the chr10 aggregate table has all of the data you expect in it?

yxu405 commented 1 year ago

Hi Jeremy! Thank you so much for your response! I double-checked and made sure the function ran to completion. Furthermore, there is room on my instance. I have run the function again, though, with the chromosome window expanded, using this command. dm.parse_bam("/Data1/secphased_mat_pat/winnowmap_verkko.sorted.secphase_v0.2.0.merge.sorted.bam", "chr10" , "/Data1/secphased_mat_pat/parsed", region = "chr10_MATERNAL:710331-720000", basemod="A", threshA=190, cores=8)

I found that there is only one base in the methylationByBase_chr10 column output, while methylationAggregate_chr10 outputs 6108 mA bases. Shouldn't all the bases in the methylation aggregate show up in the methylation by base column?

thekugelmeister commented 1 year ago

Ah, okay! Technically, no, not all bases show up in the methylationByBase table.

In other words, if there was only one 6mA in the given region across all reads, the methylationByBase table would have a single entry, while the methylationAggregate table would have multiple entries spanning the given region.

Put yet another way, the following two SQL queries should return the same values:

SELECT COUNT(*) FROM methylationByBase_chr10;
SELECT SUM(methylated_bases) FROM methylationAggregate_chr10;

Can you verify that this holds true for your data?

thekugelmeister commented 1 year ago

Hi @yxu405,

Have you made any progress on this? Is there anything else we can do to help?

yxu405 commented 1 year ago

Hi Jeremy! Yeah! That cleared things up! Thank you so much!

thekugelmeister commented 1 year ago

Glad to hear it. To follow up, in case this matters: I forgot that there is an option to parse_bam that may provide the exact functionality you expected:

extractAllBases
        One of the following:
        * ``'True'`` - Store all base mod calls, regardles of methylation probability threshold. Bases stored are those that can have a modification call (A, CG, or both depending on ``basemod`` parameter) and are sequenced bases, not all bases in the reference.
        * ``'False'`` - Only modifications above specified threshold are stored

Hopefully this helps if you need it! Otherwise, happy to provide clarification.