Issues with Markov Chain convergence and generated data in pycbc inference

Tobias-Reike commented 1 month ago

Hello,

i have a few things which are troubling me with pycbc_inference. Ive been running it like the tutorial Simulated BBH example suggests, using pycbc_create_injections first and then running pycbc_inference.

My first issue with this was the reference frequency parameter. I set it to 0 first, since this is the default value in get_td_waveform() and ive been using it like this to generate data for my neural networks. So i wanted the parameter to match between my mcmc and NN. However whenever i set f_ref=0 pycbc_inference will not converge to the right BBH parameters, but to some values way off. This was true for at least 2 runs and interestingly the output values were very similar both times even though the injection parameters were different. When setting f_ref=f_lower=30 the mcmc converged nicely to the right parameters. So i guess i dont really understand the f_ref parameter, but it seems to have a strong effect on the run.

My second issue was with the data created by pycbc_inference. I had put in an own asd file to be used for noise like this: fake-strain-from-file = E1:asd_file.txt E2:asd_file.txt E3:asd_file.txt The asd_file is an asd i used during noise generation myself. I loaded another noise txt file and provided a length, df and lower freq in python and then i saved it as a txt: psd = from_txt('./ET_asd_comb.txt', length, df, f_lower=30, is_asd_file=True) I used this asd to generate data for my NN. Now when i use the asd in the mcmc and look at the data in the output file of pycbc_inference under ['data']['E1']['stilde'] and compare it to an event i generated for my NN, these events do not match. I can see very clearly that some frequencies below the f_lower=30 are prominent in the noise of the pycbc_inference data, that are not in my events (see plots, spec show mcmc data with significant values below 30Hz). This way i can not properly compare my mcmc posteriors with my NN posteriors. Am i doing something wrong with the asds?

data_comparison_small spec_small

cdcapano commented 1 month ago

@Tobias-Reike About the first issue, with f_ref, what waveform approximant are you using? Some of the approximants handle f_ref differently. Better yet, can you provide the full config file?

Second, can you explain better what you think the issue is? I don't quite follow. Were you expecting exactly the same simulated noise between two different programs, or is it that there seems to be more noise than anticipated at some frequency? If the former, I would not expect the same noise, since this is dependent on the pseudo-random number generator (PRNG) used and the seed provided. If the program you used to generate the data for the NN (= neural network?) uses a different PRNG or different seed than pycbc (which is likely), you'll have a different realization of noise. The best way to ensure you get the same realization is to generate a strain file using pycbc_create_strain, then feed that to both pycbc_inference and your NN.

The other thing to keep in mind is that the data saved to the pycbc_inference file is after a highpass filter has been applied (assuming you set that in your config file). That will cause some differences at low frequencies between the original data and what's saved.

Anyway, if you could provide more details about your config file, and what you're trying to compare to, then we can try to figure out what the issue is.

Tobias-Reike commented 1 month ago

Hello, sorry for being a bit unclear. Ive been using IMRPhenomXPHM. Ive put all the config files together below, as well as the shell script im running. I wanted to input the data from the pycbc_inference output into my neural Network. However this data has some components below 30Hz, which is what i wanted to show with the 2 plots. This will confuse my NN and hinder the inference for the NN. However i gave fake-strain-from-file=some_asd_file.txt into my data config file for pycbc_inference. This asd file has no components below 30Hz. So then i would have expected the data which i extracted from the pycbc_inference output, to also have no components below 30Hz. But it does, which i find weird given the asd_file. It is displayed in the blue strain curve and the spectrogram. In contrast the yellow strain data is the same event with colored gaussian noise, also using this asd_file, which i generated with get_td_waveform and colored_noise from pycbc.noise.reproducable. These 2 curves dont have the same frequency components. Of course they shouldnt be exactly the same with different seeds, but i did expect there to be the same frequencies. Is this assumption wrong for some reason? I honestly forgot about trying different high pass frequencies. I just left it at 25, because I figured that it shouldnt matter, if im already using an asd that is cut off at 30Hz and a f_lower = 30Hz in my static parameters. Should i just try a high pass of 30Hz and see what happens then? I would like to avoid redoing my data generation with some new set up like pycbc_create_strain, cause im a bit short on time as it is and i already have generated some rather large data sets before, that i dont want to discard. I have just recently looked into MCMC as a comparison for the neural network again and have not done the data generation for my NN with these issues in mind. Also thanks for answering so quickly.

full_config.txt mcmc.txt

cdcapano commented 1 month ago

@Tobias-Reike Sorry to request more things, but could you attach one of your ASD files?

Tobias-Reike commented 1 month ago

@cdcapano ive got the noise asd file i used in pycbc_inference as asd_file.txt below. I created it when i did my data generation. For this I loaded some publically available ET asd file (see link in code part, also uploaded it this file as ET_asd_comb.txt) with the pycbc.psd.read.from_txt function and wrote a new file from the loaded psd. It has a lot of zeros before 30Hz.

# noise psd: loaded the ET_D noise asd from https://www.et-gw.eu/index.php/etsensitivities, arXiv:1012.0908
# the oringinal file has both HF, LF and their combination. The loaded one only has the combination.
fs, duration = 2048, 5
df = 1 / duration
length = int(fs / 2 / df) + 1
noise_psd = from_txt(
    './ET_asd_comb.txt',
    length,
    df,
    low_freq_cutoff=30,
    is_asd_file=True
)

# example of how i wrote psd to asd_file.txt
f = open("asd_file.txt", "w")
for freq, power in zip(noise_psd.sample_frequencies, noise_psd**.5):
    f.write(f'{freq:.18e} {power:.18e}\n') # wrote it this way to match with the formatting of the ET_D file from the link above
f.close()

Just in case i also provided a python script illustrating how i made the comparison plot, where i opened the inference.hdf that was output by pycbc_inference, extracted the data from it, generated an event using get_td_waveform and colored_noise and plotted them both.

asd_file.txt ET_asd_comb.txt plot_script.txt

gwastro / pycbc

Issues with Markov Chain convergence and generated data in pycbc inference #4817