nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
134 stars 37 forks source link

Fix for issue reading beta_to_450k reference #86

Closed Inforeon closed 3 weeks ago

Inforeon commented 1 month ago

Hi, I was using wgbstools for some analysis and had some problems using the beta_to_450k functionality, which I've debugged on my end. I'm making a pull request in case any one else would find this useful

error running beta_to_450k ```text multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 18, in single_beta values = beta2vec(load_beta_data(beta_path)[indices - 1], min_cov=cov_thresh) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ IndexError: arrays used as indices must be of integer (or boolean) type """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "BIN_PATH/bin/wgbstools", line 97, in main() File "BIN_PATH/bin/wgbstools", line 64, in main importlib.import_module(args.command).main() File "/LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 138, in main betas2csv(args) File "LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 98, in betas2csv arr = p.starmap(single_beta, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 774, in get raise self._value IndexError: arrays used as indices must be of integer (or boolean) type ```

This is fixed by casting the indices to type int

The error then changed to

Second error ```text LOCATION/wgbs_tools/src/python/beta_to_450k.py:95: RuntimeWarning: invalid value encountered in cast indices = np.array(df['cpg']).astype(int) multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 18, in single_beta values = beta2vec(load_beta_data(beta_path)[indices - 1], min_cov=cov_thresh) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ IndexError: index 9223372036854775807 is out of bounds for axis 0 with size 29401795 """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "BIN_PATH/bin/wgbstools", line 97, in main() File "BIN_PATH/bin/wgbstools", line 64, in main importlib.import_module(args.command).main() File "LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 138, in main betas2csv(args) File "LOCATION/wgbs_tools/src/python/beta_to_450k.py", line 98, in betas2csv arr = p.starmap(single_beta, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "LOCATION/wgbstools/lib/python3.11/multiprocessing/pool.py", line 774, in get raise self._value IndexError: index 9223372036854775807 is out of bounds for axis 0 with size 29401795 ```

Note, that 9223372036854775807 was the sys.maxsize on my system, so presumably this error is the result of casting null or inf values to int. Without the int typecast, however, the first error reoccurs

With these changes though, beta_to_450k functioned as expected, and the results of downstream analysis also came out as expected

yonniejon commented 4 weeks ago

Hey @Inforeon . Thank you for the PR and looking into this issue! The issue happens because the mapping from 450k cpgs to cpgs in hg38 is not perfect so the file has some missing rows. So your logic is correct, but I think the code should be moved. So if it's ok by you move the code as I wrote in the comments and then I will accept it! Thank you

Inforeon commented 4 weeks ago

Sure no problem. I've moved the relevant code to the load_full_ref function, at the end, under an if-statement checking for hg38, and tested it again. I've also added/changed a couple comments to reflect the code change.

Also, I'm checking genome.genome, instead of args.genome, as args.genome can be 'default' while still using hg38