scottransom / presto

Open source pulsar search and analysis toolkit
http://www.cv.nrao.edu/~sransom/presto/
GNU General Public License v2.0
237 stars 176 forks source link

Prepsubband writes less samples than original file has. #191

Open Newtonlml opened 9 months ago

Newtonlml commented 9 months ago

Hi, I am using PRESTO to process some data and look for single pulses. The files have the following specifications: Number of frequency channels: 2048 time resolution: 0.01 seconds lenght: 298.68 seconds bandwidth: 1200-1800 MHz Number of samples: 29868

I realized that when using prepsubband prepsubband -nobary -lodm 300 -dmstep 10 -numdms 20 -nsub 2048 -runavg -o prep_output 2023-10-07_18_16_51_752602noise.fil the number of samples written is less than the original number of samples of the file

prepsubband_infoARTE prepsubband_ARTE

So when I use single_pulse_search.py to search for transients it doesn't find anything after the sample 24000. I lose ~6000 samples which correspond to one minute of my five minute files. I have processed data with a higher time resolution and I noticed this happens as well but due to the higher resolution the time lost is not as much as in my lower resolution case. For example for data with the following specifications:

Number of frequency channels: 512 time resolution: 54.613 microseconds length: 3599.99913984 seconds bandwidth: 1210-1510 MHz Number of samples: 65917953

I lose approximately 25000 samples which correspond to 1.5 seconds of an observation of one hour.

A "solution" to this would be to make the files larger so I dont lose a significant amount of data but still a full minute is a lot of time lost.

I have seen something similar in the PRESTO_search_tutorial.pdf where the samples written is less than the original ones.

prepsubband_pdf

This is taken from slide 11 and as you can see the total number of points is 531000 and the data written points is 530000

Is this an expected behaviour? I am not sure why this is happening. Is there a way to make it process the full file with its current resolution? Any indication is much aprecciated. :)

scottransom commented 9 months ago

Yes, this is expected behavior. And the quick reason why is that we are missing some data to be able to output the full-duration of the data because you are de-dispersing.

Since you are dedispering to a positive DM, you are assuming that the low-frequency data is arriving later than the high-frequency data. That's fine at the beginning of the observation where we can simply grab later samples at lower freqs to add them to the initial sample at the highest freq. But that doesn't work at the end of the observation where we have no low freq data recorded! So since we don't know what to add in order to properly dedisperse, we simply ignore parts of the original file where we don't have all frequencies we need.

This problem gets worse and worse (meaning you lose a bigger and bigger fraction of your input data) when 1) the observation duration is relatively short and 2) when the total amount dispersion (in time) is long between the top and bottom of the observing band.

In your case, for a DM of 300, the differential delay across the band (1200-1800MHz) is 0.48 seconds, and so you will always lose at least that amount of data.

However, there is another issue in that prepsubband (and prepdata) work on blocks of data of (by default) 2400 samples (see the Spectra/Subint number in the output) for filterbank data. So you will often get an integer number of samples of that duration out.

And finally, both of those commands automatically choose a highly factorable number of data points to output so that FFTs on that data will be efficient.

For your initial case you are seeing all of these effects. You can mitigate some of this (but you can't get back data you don't have because of dispersion) by specifying the number of output data points to use with the -numout option. If you specify -numout to equal the number of input data points, trying looking at the end of the resulting .dat file with the exploredat command and you will see how the noise in the data changes as we have less and less data available to do the full and proper dedispersion.

Hope this helps!

Newtonlml commented 8 months ago

Thank you very much for your response.

I understood your explanation and it makes sense. I tried your suggestion of looking into the .dat file but I don't see anything weird. Is there a way to predict how much data I will lose? Because like you said, with the DM and frequency range I can obtain the least amount of data that I will lose, in the example you mentioned it was ~0.5 seconds but I am losing around a minute of data. I assume this is due to the integer number of blocks that you mentioned. When executing prepsubband it says that for my original number of samples the "good number of samples" to work with is 30240 but the data written only gets to 24000. Is there a way to know how many blocks I will lose?

scottransom commented 8 months ago

Have you tried specifying -numout? And where does it say "30240"?

Newtonlml commented 8 months ago

Yes I have tried specifying -numout to be equal to the original samples. I changed the PRESTO version to a more recent one so now the outputs look like this: -prepsubband -nobary -lodm 300 -dmstep 1 -numdms 1 -downsamp 1 -nsub 2048 -runavg -numout 29868 -o prep_output 2022-09-10_11_15_25.404305.fil

image

And this is without the -numout (where the 30240 comes from):

image

As you can see, the data points written in both cases is 24000 and is padding the rest, here is a screenshot of exploredat of whren the -numout is specified but it looks the same when is not specified.

image
scottransom commented 8 months ago

One other request for you to try: can you try using prepdata to make a single time series rather than prepsubband? The latter is quite a bit more complicated (since it allows you to output mutiple time series at once), so there might be a bit of extra overhead there.

Newtonlml commented 8 months ago

I tried prepdata and it doesn't remove as much samples as prepsubband. Specifying -numout or not, it writes 28800 samples instead of the 24000 from prepsubband.

prepdata_output

This amount of data lost makes more sense giving the amount of samples per data block and the least amount of data I should lose in the process. So it seems that prepsubband may be removing more samples in the process to optimize the output of many time series? In that case prepdata seems to be the best command for my case. Thank you very much for your help :)