RuntimeError: The data contains non-finite values.

andradejon commented 1 year ago

Hello,

I'm running BINDetect using the following input: TOBIAS BINDetect --motifs ${PFM} --signals /_footprints.bw --genome ${GENOME} --peaks tmp.clip.bed --peak_header peak_header.txt --outdir BINDetectoutput${NAME} --cores 8

My TOBIAS version is 0.15.0, and my motifs are JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt

It works just fine for every peakset (tmp.clip.bed) except one, and I'm not sure why. Right after the step "Processing scanned TFBS individually" completes I get this error: 2023-03-04 13:07:24 (38353) [INFO] Progress 99.88% 2023-03-04 13:07:25 (38353) [INFO] Progress 100.0% 2023-03-04 13:07:25 (38353) [INFO] Progress done! multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/jet/home/andradej/.local/lib/python3.6/site-packages/tobias/tools/bindetect_functions.py", line 508, in process_tfbs obs_params = diff_dist.fit(observed_log2fcs) File "/jet/home/andradej/.local/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py", line 351, in fit raise RuntimeError("The data contains non-finite values.") RuntimeError: The data contains non-finite values. """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/jet/home/andradej/.local/bin/TOBIAS", line 11, in load_entry_point('tobias==0.15.0', 'console_scripts', 'TOBIAS')() File "/jet/home/andradej/.local/lib/python3.6/site-packages/tobias/TOBIAS.py", line 154, in main args.func(args) File "/jet/home/andradej/.local/lib/python3.6/site-packages/tobias/tools/bindetect.py", line 674, in run_bindetect results = [task.get() for task in task_list] File "/jet/home/andradej/.local/lib/python3.6/site-packages/tobias/tools/bindetect.py", line 674, in results = [task.get() for task in task_list] File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise self._value RuntimeError: The data contains non-finite values. 2023-03-04 13:07:25 (38454) [ERROR] Multiprocessing logger lost connection to queue - probably due to an error raised from a child process.

Any assistance would be appreciated thank you.

msbentsen commented 1 year ago

Hi,

Thank you for your issue. Can you give me a little bit more information on your setup of samples? In your example, it shows only one "footprints.bw", but from the log, it looks there were several.

BINDetect is intended to be run on the full peak set of all conditions used in the input --signals - is this the case in your example?

andradejon commented 1 year ago

Oh sorry. My text was altered a little upon posting. Here's the actual command I used. TOBIAS BINDetect --motifs ${PFM} --signals */*footprints.bw --genome ${GENOME} --peaks tmp.clip.bed --peak_header peak_header.txt --outdir BINDetect_output${NAME} --cores 8

I have 6 time points where each time point has at least 4 replicates. I merged the bam files of each replicate within a time point and passed these merged bam files to ATACCorrect and ScoreBigWig using the following loop, where $COND represents the different time points (i.e. conditions), and ../diffbind/consensuspeaks.bed is the full set of consensus peaks I used for differential analysis done previously:

while read COND && [ "${COND}" != "" ] do TOBIAS ATACorrect --bam bams/${COND}.bam --genome ${GENOME} --peaks ../diffbind/consensuspeaks.bed --outdir ${COND} --cores 8 TOBIAS ScoreBigwig --signal ${COND}/${COND}_corrected.bw --regions ../diffbind/consensuspeaks.bed --output ${COND}/${COND}_footprints.bw --cores 8 done < conditions.txt

For BINDetect, I'm essentially interested in differential TF binding within only a subset of the full consensus peakset. I'm trying to use the all the signals generated from ScoreBigWig but only use my subset-of-interest for the analysis. I'm not quite sure if this is the right way of doing it, so please let me know if I'm wrong here. In any case, the only thing I'm changing between runs is the subset of peaks I'm using for --peaks, and I only get the runtime error with one specific subset for some reason.

Much appreciated!

msbentsen commented 1 year ago

Hi, thank you for the update.

When you are using a subset of the full consensus peaks (tmp.clip.bed), are these 100% contained within the consensuspeaks.bed, or could there be regions outside of these?

A way you might run this setup is to use the option --output-peaks in TOBIAS BINDetect, e.g. TOBIAS BINDetect --peaks ../diffbind/consensuspeaks.bed --output-peaks tmp.clip.bed (...) which will run normalization on all peaks, but only output the differential analysis on the subset-of-interest regions. I imagine that this would solve the issue, but please let me know if not!

andradejon commented 1 year ago

Thanks for the advice! I looked at my consensus peakset and subset peakset and saw that the subset was not fully contained in the larger consensus set. It turns out I was using the wrong version (uncentered and not resized) consensus peaks to generate footprint signals with ATACorrect and ScoreBigwig. I corrected this and I was able to run BINDetect with my problem subset without even specifying --output-peaks. This is the command I used and it was just fine: TOBIAS BINDetect --motifs ${PFM} --signals */*_footprints.bw --genome ${GENOME} --peaks problem_peakset.bed --peak_header peak_header.txt --outdir BINDetect_output_${NAME} --cores 8

I also tried to use the full consensus set as --peaks and specify --output-peaks at your recommendation, but this seems to raise another question that you may be able to help with.

If I run this: TOBIAS BINDetect --motifs ${PFM} --signals */*_footprints.bw --genome ${GENOME} --peaks consensuspeaks.bed --output-peaks some_subset.bed --peak_header peak_header.txt --outdir BINDetect_output_${NAME} --cores 8 And look at the volcano plot, I get this:

But if I don't specify --output-peaks and just use my subset peaks as --peaks using this command: TOBIAS BINDetect --motifs ${PFM} --signals */*_footprints.bw --genome ${GENOME} --peaks some_subset.bed --peak_header peak_header.txt --outdir BINDetect_output_${NAME} --cores 8 I get this volcano plot:

They give very different results, and the run using the consensus set as background only shows increases in average binding. Could this be correct? It might be important to know that all of the peaks in some_subset.bed increase in accessibility between the two conditions I'm comparing in the volcano plots, but I don't know why then this effect is only seen when I include my background peaks and not in both runs.

Thanks again. This has been a big help.

msbentsen commented 1 year ago

Hi,

The input --peaks are used to normalize signals between the conditions, so if these only contain peaks going in one direction, BINDetect will try to adjust the scores of the "lower signal" condition to fit the one of the "higher signal" condition. This does not always work so well.

Instead, you can give the full peak set in --peaks, and only the subset in --output-peaks, which will then still perform the normalization on the full peaks. But if your --output-peaks only contain upregulated regions, all the TFs will also be shown as being increased (as you see with the shift of the volcano). So I will say this is to be expected. Hope that helps!

andradejon commented 1 year ago

It helps a lot! Thanks!

loosolab / TOBIAS

RuntimeError: The data contains non-finite values. #199