kotekan / kotekan

High performance radio data processing pipeline
http://kotekan.rtfd.io
Other
25 stars 15 forks source link

Runtime error in BasebandReadoutManager #457

Open cubranic opened 5 years ago

cubranic commented 5 years ago

@kiyo-masui reported random, but fairly frequent errors in basebandReadoutManager::get_next_ready_request().

The whole design of how BasebandReadout threads coordinate reading/writing of baseband data and the overall request queue needs to be reviewed.

leungcalvin commented 5 years ago

When running kotekan on the Pathfinder VLBI recorder, running 32 streams, 8 frequencies in each stream, covering 1/4 of the band. Branch cl/numa-multifreq, commit 35af5c7e84

Kotekan seems to crash about halfway through writing baseband dump files to NFS when the baseband dumps are 1 second long, but not when the dumps are 100 ms long.

The following is probably irrelevant for the crash and is a memory bandwidth issue --Kiyo During the 1 second dumps, we noticed that Kotekan experiences tens of % packet loss as the dumps are being copied out of the ring buffer to memory. Packet loss drops back to ~0% but then Kotekan crashes as it's writing to disk over NSF.

Traceback:
/baseband/baseband_12: Baseband dump for event 1564075399, freq 138 c 
omplete.                                                              
/baseband/baseband_12: After write_dump() for freq_id:138             
/baseband/baseband_12: Before write_dump() for freq_id:266            
/baseband/baseband_12: Writing baseband dump to /mnt/frb-baseband/kiy 
o_pathfinder//pathfinder_test_triggers/baseband_1564075399_266.h5     
/baseband/baseband_6: Baseband dump for event 1564075399, freq 299 co 
mplete.                                                               
/baseband/baseband_6: After write_dump() for freq_id:299              
terminate called after throwing an instance of 'std::runtime_error'   
  what():  No ready request                                           
Aborted              
kiyo-masui commented 5 years ago

Update: setting write_throttle to 100, which slows down the write by a factor of roughly 3, seems to prevent the crash. We do not know why.

kiyo-masui commented 5 years ago

Another update: The above only makes the crash less likely, it does not prevent it.