nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
125 stars 33 forks source link

wgbstools segment error #17

Closed gibberwocky closed 1 year ago

gibberwocky commented 1 year ago

I'm attempting to segment the (hg19) beta files available from here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186458

I'm getting the following error:

[ add_loci ] Failed! exception: [wt add_loci] line 0: endCpG < startCpG [wt segment] found 74,349 blocks (dropped 106,621 short blocks) Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/segment.py", line 310, in main SegmentByChunks(args, betas).run() File "./wgbs_tools/src/python/segment.py", line 147, in run self.dump_result(df.reset_index(drop=True)) File "./wgbs_tools/src/python/segment.py", line 183, in dump_result add_bed_to_cpgs(temp_path, self.genome.genome, self.args.out_path) File "./wgbs_tools/src/python/convert.py", line 245, in add_bed_to_cpgs subprocess.check_call(cmd, shell=True) File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat qfp0iepq | ./wgbs_tools/src/cpg2bed/add_loci ./wgbs_tools/references/hg19/CpG.bed.gz ./wgbs_tools/references/hg19/CpG.chrome.size > ./GSE186458/GSE186458.chr20.blocks.small.bed' returned non-zero exit status 1. Segmenting

Any thoughts on what the problem might be and how to fix it?

Thanks

D

gibberwocky commented 1 year ago

If I attempt to use pre-generated blocks download from the same GSE repository then I get the following error:

Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/beta_to_table.py", line 150, in main for chunk in chunks: File "./wgbs_tools/src/python/beta_to_table.py", line 133, in beta2table_generator yield get_table(subset_blocks, gf, min_cov, threads, verbose) File "./wgbs_tools/src/python/beta_to_table.py", line 68, in get_table is_nice, _ = is_block_file_nice(blocks_df) File "./wgbs_tools/src/python/beta_to_blocks.py", line 28, in is_block_file_nice if df[['startCpG', 'endCpG']].isna().values.sum() > 0: File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/site-packages/pandas/core/generic.py", line 3081, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'isna'

So I'm not having much luck, unfortunately.

yonniejon commented 1 year ago

Hi, can you post the exact command you used? And some example head command output of your input files? A

gibberwocky commented 1 year ago

Here is the command I've used for segmenting:

${WGBSTOOLS} segment --threads ${NSLOTS} --genome hg19 --betas ${IN}/*.beta --min_cpg 4 --max_bp 5000 -r chr${SGE_TASK_ID} -o ${PREFIX}.chr${SGE_TASK_ID}.blocks.small.bed

And here is an example of some chr20 data from from one of the files:

${WGBSTOOLS} view GSM5652211_Lung-Bronchus-Smooth-Muscle-Z00000421.beta | grep "^chr20" | head chr20 60178 60180 3 33 chr20 60425 60427 28 34 chr20 60431 60433 26 35 chr20 60550 60552 7 24 chr20 60577 60579 3 25 chr20 60721 60723 7 30 chr20 60806 60808 24 31 chr20 60962 60964 28 49 chr20 61069 61071 23 44 chr20 61181 61183 27 35

Edit:

For some reason github isn't liking pasting from linux. The output should be showing as a standard beta file format, but appears to be lacking carriage returns once pasted.

${WGBSTOOLS} view GSM5652211_Lung-Bronchus-Smooth-Muscle-Z00000421.beta | grep "^chr20" | head chr20 60178 60180 3 33 chr20 60425 60427 28 34 chr20 60431 60433 26 35 chr20 60550 60552 7 24 chr20 60577 60579 3 25 chr20 60721 60723 7 30 chr20 60806 60808 24 31 chr20 60962 60964 28 49 chr20 61069 61071 23 44 chr20 61181 61183 27 35

gibberwocky commented 1 year ago

Just to add to that, I've tried this with just one input beta file and the error is the same. If I use awk to output any lines where field 3 is less than field 2, then no lines are reported. This indicates that none of the lines in the beta file has an end position less than the start position.

${WGBSTOOLS} view GSM5652176_Adipocytes-Z000000T7.beta | awk '{if($3<$2) print $0;}' -

yonniejon commented 1 year ago

I'm not reproducing the error on my side with some random chromosomes i tried. Which chromosome? chromosome 20? all of them? I downloaded the data from the link you posted and set up a clean wgbstools and ran and did not reproduce. My best guess is that our reference/annotation data is different but I will try to see if I can find other causes.

gibberwocky commented 1 year ago

I've just tried following the tutorial and run into the same problem at the point of segmenting:

${WGBSTOOLS} segment --betas *beta --min_cpg 3 --max_bp 2000 -r $region -o blocks.small.bed [wt segment] found 9 blocks (dropped 9 short blocks) [ add_loci ] Failed! exception: [wt add_loci] line 0: endCpG < startCpG Traceback (most recent call last): File "./wgbs_tools/wgbstools", line 96, in <module> main() File "./wgbs_tools/wgbstools", line 63, in main importlib.import_module(args.command).main() File "./wgbs_tools/src/python/segment.py", line 310, in main SegmentByChunks(args, betas).run() File "./wgbs_tools/src/python/segment.py", line 147, in run self.dump_result(df.reset_index(drop=True)) File "./wgbs_tools/src/python/segment.py", line 183, in dump_result add_bed_to_cpgs(temp_path, self.genome.genome, self.args.out_path) File "./wgbs_tools/src/python/convert.py", line 245, in add_bed_to_cpgs subprocess.check_call(cmd, shell=True) File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat sp68njbk | ./wgbs_tools/src/cpg2bed/add_loci ./wgbs_tools/references/hg19/CpG.bed.gz ./wgbs_tools/references/hg19/CpG.chrome.size > blocks.small.bed' returned non-zero exit status 1.

So I'm assuming its a problem with my software environment. Python version 3.6.3 is loaded, along with samtools 1.13 and BEDtools2.30.0. There were no warnings when installing wgbstools:

$ python setup.py Compiling stdin2beta... SUCCESS Compiling stdin2pairs... SUCCESS Compiling pat_sampler... SUCCESS Compiling patter... SUCCESS Compiling bp_patter... SUCCESS Compiling snp_patter... SUCCESS Compiling match_maker... SUCCESS Compiling segmentor... SUCCESS Compiling cview... SUCCESS Compiling homog... SUCCESS Compiling add_cpg_counts... SUCCESS Compiling add_loci... SUCCESS

yonniejon commented 1 year ago

Which pandas version are you using?

gibberwocky commented 1 year ago

I'm using the default init_genome command:

$ ./wgbstools init_genome hg19
[wt init] Setting up genome reference files in ./wgbs_tools/references/hg19 [wt init] No reference FASTA provided. Attempting to download from https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 904M 100 904M 0 0 23.8M 0 0:00:38 0:00:38 --:--:-- 25.6M [wt init] successfully downloaded FASTA. Now gunzip and bgzip it... [wt init] Indexing ./wgbs_tools/references/hg19/hg19.fa.gz [wt init] Generated index file: ./wgbs_tools/references/hg19/hg19.fa.gz.fai [wt init] Processing chromosomes... [wt init] chromosome: chr1 [wt init] chromosome: chr2 [wt init] chromosome: chr3 [wt init] chromosome: chr4 [wt init] chromosome: chr5 [wt init] chromosome: chr7 [wt init] chromosome: chr6 [wt init] chromosome: chr8 [wt init] chromosome: chr9 [wt init] chromosome: chr10 [wt init] chromosome: chr11 [wt init] chromosome: chr12 [wt init] chromosome: chr13 [wt init] chromosome: chr14 [wt init] chromosome: chr15 [wt init] chromosome: chr16 [wt init] chromosome: chr17 [wt init] chromosome: chr18 [wt init] chromosome: chr19 [wt init] chromosome: chr20 [wt init] chromosome: chr21 [wt init] chromosome: chr22 [wt init] chromosome: chrX [wt init] chromosome: chrY [wt init] chromosome: chrM [wt init] Building CpG-Index dictionary... [wt init] bgzip and index... [wt init] Finished initialization of genome hg19

According to pip list I'm using pandas (0.20.3).

yonniejon commented 1 year ago

I suspect the pandas version is the issue. I have version '1.3.5'. We will try to check what is the minimum version required, but I suspect this is the issue.

gibberwocky commented 1 year ago

I've just installed pandas 1.5.1 and the tutorial segment worked fine, so it looks like that was the problem.

Thanks!

nloyfer commented 1 year ago

Excellent. Thank you for raising this issue. I updated the README dependencies section to state pandas version >1 is required