nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
125 stars 33 forks source link

find_markers for RRBS #44

Closed AndriesDeKoker closed 10 months ago

AndriesDeKoker commented 1 year ago

Hi,

I have been trying to U-markers specific for RRBS using wgbstools find_markers --betas *hg38.beta -b RRBS_regions_hg38 -g Group_file --only_hypo

but I was running into this error: https://stackoverflow.com/questions/75798944/how-to-solve-invalid-input-argument-popmean-shapeaxis-must-equal-1-when-r/76235648#76235648 and it seems to be related to the groups where only one beta file is available (because if I'm ignoring these, the code runs) I made the group file using the 'Group' or 'refined_group' column in suppl table 1 But in the published markerfiles and atlases regions are available for these groups containing one file (Epid-Kerat, Osteo, etc), how should I generate markers for these entities?

Could more info be given about the block-file for the RRBS regions, how these were made? I now performed an in-silico digest with MspI and restricting the fragment lengths from 20-200bp.

Kind regards, Andries

nloyfer commented 1 year ago

Hi there,

Thank you for bringing this bug to my attention. I found that the error message was originating from the scipy.stats.ttest_ind function when one of the input arguments was not 2D. This occurs when the target/background has only one sample (n=1).

I have fixed the issue in find_markers.py, and it should now be working as expected. Thank you again.

nloyfer commented 1 year ago

About the RRBS atlas. We selected regions that showed high coverage across multiple samples in some large RRBS cohort (METABRIC). Perhaps this way of choosing "RRBS regions" is not the best way. If you have your own set of regions where you know you are getting high coverage, you can make your own "RRBS atlas".

Once you have a bed file with such regions, you can intersect it with the "whole genome blocks", then find markers with this block file as reference

bedtools intersect -a blocks.s205.5CpG.bed.gz -b RRBS_regions.bed.gz -wa -u > blocks.5CpG.RRBS.bed
wgbstools find_markers --config_file CONFIG_FILE -b blocks.5CpG.RRBS.bed
AndriesDeKoker commented 1 year ago

Hi there,

Thank you for bringing this bug to my attention. I found that the error message was originating from the scipy.stats.ttest_ind function when one of the input arguments was not 2D. This occurs when the target/background has only one sample (n=1).

I have fixed the issue in find_markers.py, and it should now be working as expected. Thank you again.

Hi,

I indeed uninstalled scipy and it ran without ttest. I pulled this fix and am rerunning with scipy atm

Thanks!

AndriesDeKoker commented 1 year ago

About the RRBS atlas. We selected regions that showed high coverage across multiple samples in some large RRBS cohort (METABRIC). Perhaps this way of choosing "RRBS regions" is not the best way. If you have your own set of regions where you know you are getting high coverage, you can make your own "RRBS atlas".

Once you have a bed file with such regions, you can intersect it with the "whole genome blocks", then find markers with this block file as reference

bedtools intersect -a blocks.s205.5CpG.bed.gz -b RRBS_regions.bed.gz -wa -u > blocks.5CpG.RRBS.bed
wgbstools find_markers --config_file CONFIG_FILE -b blocks.5CpG.RRBS.bed

aha, well, I guess that getting RRBS regions from a lot of datasets would do the job as well. I did it this way:

  1. insilico digest with MspI of hg38 (chr - start - stop in bed format)
  2. wgbstools convert --bed_file RRBS_regions.bed > RRBS_block_file

This is not the way to do it?

EDIT: I think I see the reasoning doing it not this way as you don't know if the RRBS-blocks are comethylated at this point, right?

AndriesDeKoker commented 1 year ago

wgbstools find_markers is acting somewhat strange at the moment, giving lots of NA-markers?

#chr start end startCpG endCpG target region lenCpG bp tg_mean bg_mean delta_means delta_quants delta_maxmin ttest direction NA NA NA NA NA Adipocytes NA nanCpGs nanbp 0.161 0.842 0.681 0.0271 -0.162 1.03e-08 U chr1 1.84e+08 1.84e+08 1.77e+06 1.77e+06 Adipocytes chr1:183870545.0-183870631.0 5.0CpGs 86.0bp 0.142 0.809 0.667 0.0138 -0.0822 4.93e-07 U chr4 1.43e+08 1.43e+08 7.36e+06 7.36e+06 Adipocytes chr4:143282734.0-143282862.0 8.0CpGs 128.0bp 0.225 0.88 0.655 0.2 -0.15 1.25e-11 U chr9 3.34e+07 3.34e+07 1.4e+07 1.4e+07 Adipocytes chr9:33402120.0-33402450.0 7.0CpGs 330.0bp 0.142 0.795 0.653 0.0775 -0.03141.29e-07 U chr2 1.27e+08 1.27e+08 3.58e+06 3.58e+06 Adipocytes chr2:127121016.0-127121256.0 5.0CpGs 240.0bp 0.224 0.87 0.646 0.231 -0.149 1.28e-12 U NA NA NA NA NA Adipocytes NA nanCpGs nanbp 0.24 0.874 0.634 0.0661 -0.148 1.25e-09 U NA NA NA NA NA Adipocytes NA nanCpGs nanbp 0.185 0.813 0.628 0.0202 -0.0675 5.61e-09 U NA NA NA NA NA Adipocytes NA nanCpGs nanbp 0.273 0.898 0.625 0.18 0.0341 1.52e-16 U NA NA NA NA NA Adipocytes NA nanCpGs nanbp 0.087 0.696 0.609 0.0424 -0.0713 2.82e-05 U

nloyfer commented 1 year ago

Right, I accidentally broke something in beta_to_table.py in 7df331a (when dealing with the fragmentation warning), and fixed it a few hours later in f4031b8. Sorry about that. Please pull again.

nloyfer commented 1 year ago

About the RRBS atlas. We selected regions that showed high coverage across multiple samples in some large RRBS cohort (METABRIC). Perhaps this way of choosing "RRBS regions" is not the best way. If you have your own set of regions where you know you are getting high coverage, you can make your own "RRBS atlas". Once you have a bed file with such regions, you can intersect it with the "whole genome blocks", then find markers with this block file as reference

bedtools intersect -a blocks.s205.5CpG.bed.gz -b RRBS_regions.bed.gz -wa -u > blocks.5CpG.RRBS.bed
wgbstools find_markers --config_file CONFIG_FILE -b blocks.5CpG.RRBS.bed

aha, well, I guess that getting RRBS regions from a lot of datasets would do the job as well. I did it this way:

  1. insilico digest with MspI of hg38 (chr - start - stop in bed format)
  2. wgbstools convert --bed_file RRBS_regions.bed > RRBS_block_file

This is not the way to do it?

That works too. Intersecting RRBS_regions.bed with the blocks file might be an improvement, because it's possible some of the RRBS regions are not homogeneous (methylation-wise), and it's better to break them into smaller blocks

AndriesDeKoker commented 1 year ago

Thank you very much already!

And maybe just a small bug (or I am doing something wrong) but calling top:25 or top:250 in the config file does not seem to do anything?

nloyfer commented 10 months ago

Fixed. Thank you