orcasound / ambient-sound-analysis

This repository aims to hold code for UW MSDS capstone project analyzing ambient sounds in orcasound hydrophone data
MIT License
4 stars 4 forks source link

Missing wav Files #52

Closed zprice12 closed 5 months ago

zprice12 commented 6 months ago

When running the below code with the polling interval set to 600 (10mins) in the pipeline's generate_psds function

Screen Shot 2024-02-05 at 5 10 07 PM

we produce the following wav files:

Screen Shot 2024-02-05 at 5 06 43 PM

There is no wav file created for 11:20-11:30, which causes missing data in our parquet files. When running the same code, but with polling interval set to 3600 (1hr) in the pipeline's generate_psds function we produce:

Screen Shot 2024-02-05 at 5 09 21 PM

This is an hour long wav file that contains audio for the missing 11:20-11:30 segment. However, it's not fully clear this is the case because the file is labeled for 11:30. I checked the S3 buckets to see if the ts files exist for that ten minute span and they do, for example https://s3.console.aws.amazon.com/s3/object/streaming-orcasound-net?region=us-west-2&bucketType=general&prefix=rpi_port_townsend/hls/1679488225/live2112.ts. I will look further into this issue this week to hopefully find its source.

ttan06 commented 6 months ago

Image

3_22_11-3_22_12-printstatements.txt

@zprice12 @vaibhavmehrotraml

Found a potential reason why the .wav files are missing - it seems that the .ts files aren't being downloaded, including the example you had live2112.ts. The last file downloaded from the 1679488225 folder is live2097.ts so anything after that isn't being downloaded. I imagine once those are downloaded, the wav files should appear.

Image

Looking at folder 1679488225, it seems to be the last 10 minutes in these folders that correspond to the 11:20-11:30 timeline, which extends to 11:30:10, so I think the next folder (1679509823) that contains the 11:30-11:40 window then starts at ~11:30:23.

zprice12 commented 6 months ago

Yeah I noticed the same thing with those ts files never being downloaded. Curious why it's happening here. Going to look at other missing wav file examples throughout the same day and see if there is some sort of pattern to this bug

ttan06 commented 6 months ago

So when I set the date time in the generate_parquet_file function to start exactly at 11:20 and end at 11:30, the function works to get the data.

Image

But when I look at 11:20 and end at 11:40, it skips the 11:20 data and just runs 11:30-11:40.

Image

And starting at 11:10 going to 11:40, it skips over, as usual and fills in the first 5 minutes with 11:19 and the last 5 minutes with 11:30.

Image

scottveirs commented 6 months ago

Valentina thought during today's call: this is reminiscent of an observations made by Molkree. In some cases the .ts segments are present but not listed within the .m3u8 manifest (often near the end of a daily archive folder). In others, the .ts segments are missing or not of the expected file size.

The former is likely an occasional bug or edge case in ffmpeg within the orcanode code running at each hydrophone location.

The latter may be a bug in hls-utils package... @ttan06 says it typically occurs around the ~2200th .ts segment -- possibly near the edge of the archive folders (6 hour or 24 hour)? @vaibhavmehrotraml suspects a greater than vs greater than and equal to error.

zprice12 commented 5 months ago

The issue occurs due to FFmpeg's creation of the M3u8 manifest files in each of the S3 folders. The manifest files often are missing the last minute of .ts files. Causing the code to jump too soon to the next folder. We fixed the issue temporarily by manually adding the .ts files to the manifest for one of the folders, which solved the problem. However, this would need to be done for all of the existing manifest files to fully fix this issue