thomas-xin / Audio-Image-Converter

Converts audio to and from a square image representing the time-frequency domain. Achieves almost 50% compression ratio on most files with almost completely transparent reconstruction, especially in HSV mode.
GNU General Public License v3.0
3 stars 0 forks source link

FileNotFoundError: [WinError 2] The system cannot find the file specified #1

Closed ghost closed 2 years ago

ghost commented 2 years ago

Hey, just trying out your code. I'm running into this error after trying to input a file after "Please input a filename or URL:" This is python 3.9 on Windows 10 using the powershell.
I tried dragging a wav file in, as well as typing the location.

ghost commented 2 years ago

Nevermind! I didn't have ffmpeg in my PATH.

Second question I have though is, is there a way to edit the code to make the output grayscale instead?
Even if the output image would be extra big?

thomas-xin commented 2 years ago

Nevermind! I didn't have ffmpeg in my PATH.

Second question I have though is, is there a way to edit the code to make the output grayscale instead? Even if the output image would be extra big?

Alright, to address the second question, there is a slight problem with implementing this, and that is that in the original image, both saturation and lightness are used for amplitude, and hue is used for phase. Converting to a greyscale output would mean phase would need to be implemented differently as now both saturation and hue are out, and using lightness alone is not enough to store information for both channels. I could investigate how easily it can be implemented with a second layer on the image or similar, but for the most part one cannot simply make a larger image to store the same data just using greyscale. Simply too much capacity for information is lost by doing this, that the algorithm's fundamentals no longer work.

ghost commented 2 years ago

Fascinating...and sorry I totally misread the program description, I see. I have been trying out: https://github.com/Eiim/JAEG which is similar but grayscale, but when I am taking a picture and loading it back in, any slight rotation of the image messes up the sound. I did try and load a grayscale version of an output PNG created by your program back in and got some "results" back, but the sound wasnt too audible...and now I know why haha! Cropping seems to work fairly well though, with also some interesting results.
Going off track here...I can close this and reopen with this different question if it is preferable.

ghost commented 2 years ago
thomas-xin commented 2 years ago

You could open this as a new issue if you'd like, I don't particularly mind, I make my programs as a hobby lol. I took a look at the repo you linked, and what appears a little strange to me is... it doesn't actually appear to use fft, and just directly encodes the pcm data into the image as brightness values? So essentially the jpeg conversion, which has its own compression using mdct, would presumably compress the image in a similar way mp3 would, except maybe slightly less efficiently? I'm not familiar with js, so if I got anything wrong please let me know, but it appears to not have used any frequency-time domain. It also appears to normalise the output, forcing the peak to be at 100% volume even if the original audio was softer in volume, which is something I managed to avoid in my audio conversion. A circular spectrogram could work in theory, but being able to get a working reconstruction of that at any rotation would require rotational symmetry (since just constructing a circular shape won't work, rotating it will still mess up all the pitches), so the data would need to be duplicated in 4 directions (assuming you only do 90 degree rotations) as well as in the 3 colours, making it at least 12 times as wasteful as the current system I'm using. However, it would technically be doable. Additionally though, a circular spectrogram presents the problem of having less space to store data near the centre, and more in the corners, making it difficult to design an algorithm that can take advantage of such a configuration, unlike a square or rectangle which can easily be fitted with pcm or fft data. One configuration with rotational symmetry I can potentially see working would be a sort of swastika shape with repeating square-shaped spiral arms, but that would be quite messy to encode/decode, and I'm not sure how well it'd compress in jpeg or other formats. This is an interesting idea nonetheless 🙃

thomas-xin commented 2 years ago

One thing I can perhaps propose to optimise the rotational symmetry problem would be to have an indicator within the image telling the program which direction is up/down, so the program can reorient the image by itself when that occurs. This would mean that a circle or spiral shape would no longer be needed.

thomas-xin commented 2 years ago

Oh, so... your aim would be to make a visual representation of audio that can then be captured on a camera and reconstructed, similar to data on say, a QR code? I can't say I'd be the best person for a job like that, since I'm not familiar with image recognition and translating a rotated and potentially perspective-stretched copy of an image back to the original, but I could potentially give it a shot sometime

ghost commented 2 years ago

I'm not sure there has to be image recognition or indicators at all though, I think it might be possible without it? Im just not sure how haha.

hucario commented 2 years ago

Oh, so... your aim would be to make a visual representation of audio that can then be captured on a camera and reconstructed, similar to data on say, a QR code?

I mean, it doesn't have to be just similar to a QR code. It could literally be a QR code. The problem there is storage space.

ghost commented 2 years ago

Yes, I was thinking more similar to a spectrogram as opposed to QR, high density QR like Piql is a thing, but more im curious about an analog way of doing it. I feel like 360 degree FFT is possible, but I have zero proof!

thomas-xin commented 2 years ago

Right, the problem with QR codes is that they further reduce the amount of information one can store, by having only pure black and white pixels. The full rgb/hsv spectrums I use in my current program have up to 16777216 different colours per pixel. Compared to a greyscale image, which would have only 256, or a pure black and white, which would only have 2. The reason QR codes do this however, is that they require some degree of image recognition to read. If the camera has any degree of roll rotation, the resulting image will be rotated. This is somewhat easily correctable, but the camera may also be tilted on its pitch or yaw axes, which would cause the image to be stretched in a sort of perspective transform, where one end is wider than the other, in a trapezium shape. With a large image it is virtually impossible to avoid doing this even slightly when capturing an image, and such an error would need to be detected and corrected. On top of that, there is an issue with lighting. Depending on the lighting in the environment, one side of the image could potentially be made brighter than it should be, which would compromise the 256 possible brightness values, which make the image very sensitive to disturbance. Hence again why QR codes avoid this and only use 2 colours per pixel. I'm not confident I'll be able to make a system that can recognise audio from a photo. I could give it a shot, even if just being QR codes, but it would not be able to be stable storing very long samples of audio, since it would eventually be too unreasonably large and complex for the image to be captured in a reconstructable way.

ghost commented 2 years ago

QR codes also certainly can not be cropped/zoomed in on, or do I have that wrong?
Yes there is the lighting issue as well. But I think with that optical sound link above (paper sound) the lighting does not matter as much, so im not sure how that is done, but I think grayscale works whereas color does not....or well, it can, but from what I've read it is super duper complicated, anyways.
There is also: http://stevehanov.ca/wavelet/ wavelets, but this is pretty beyond me, but they seem like you can crop/zoom part of the image and get that section of sound back. No rotation though. (not sure) I do think having it start in the middle makes sense, only because when you crop an image you start from the outside and work your way to the middle anyways. Thanks for hearing my crazy ideas! 👍

thomas-xin commented 2 years ago

I believe the reason QR codes no longer work if you crop out the sides is that they need the three black squares in the corners that tell the reader where the data ends, and which way the image should be oriented. The wavelet transform looks very interesting, particulaly because I've actually wanted to look into wavelet transforms for spectrograms but never quite understood the maths behind them. The result of that would not be easily captured on a camera though, because simply cropping an image works fine, you'd just be taking a portion of the original image. But again, taking an actual photo presents other losses like the lighting issues and more importantly, differing pixel densities caused by any perspective transforms. Because we'd be working with very dense data, it'd be extremely sensitive to any disturbances. I'm actually slightly skeptical the other algorithm you linked works on a photo, but I could be wrong. Either way, it would be pretty rough to try to store data this way. It would be like trying to take a photo of a record or CD, and attempting to reconstruct from that. Audio data is simply too dense for that, and the latter requires literal laser precision to read.

ghost commented 2 years ago

-- I'm actually slightly skeptical the other algorithm you linked works on a photo, but I could be wrong. Oh the JPEG one? It barely works, mostly because that definitely wasn't what it was intended for, but I was able to hear some faint audio, but extremely distorted. That was reading it off of paper.
Sounds like: https://www.dropbox.com/s/0go17l5xau0f7zd/scaled.wav?dl=0 I haven't had the chance to try the paper optical sound one yet, but I think optical sound should be the way to go, the issue seems to be more how the data is interpreted and not how its created. One thing I had thought of was to have each frequency band as a variable density strip, but that still doesn't really solve the rotation issue, it solves the cropping issue though. You could crop inwards and lose the beginning and end of the track, but I don't think that would matter so much, and might even be kinda cool.
I tried to understand wavelets as well, but I barely can understand regular spectrogram stuff. Differing pixel densities in high density color would be even worse from how I understand it. I think naturally binary makes sense, but grayscale would certainly be a one-up. Would you say it the data should exit the center though? Like time outwards? I can't seem to picture it any other way. Thanks for looking into this!

Here is also a circular spectrogram code I found: https://editor.p5js.org/rjgilmour/sketches/aY_c4k3j2

thomas-xin commented 2 years ago

Assuming the data is stored as a disc shape, I would not go putting the audio through a fft or other transform. There is simply too much resulting data in the form of amplitude and frequency over discrete time intervals, to be expressed in a section of an image that is able to take random losses and still be readable. Especially the fact that the data is in discrete time intervals, a few more pixels being read in a certain section than intended would result in a severe desync that would compromise every single segment of audio afterwards, whereas with pcm data it would simply result in a basically unnoticeably slight stutter and delay. You can actually try this with my program. Without lossy compression on the image, it is able to reconstruct audio at almost 1:1 transparency. But if you make the image a few pixels wider in the middle, and either dump garbage data in the new section or resize another section to fit that data, you'll not only be compromising that section, but all of the audio afterwards. Of course, high density colour would be a lot more likely to lose data than greyscale, but I think the key point is to account for differences in itself, rather than the way those differences are expressed. If say, we used RGB channels at intervals of 4, so unique colours could be 000000, 000004, 040404, FCFCFC, FCF8FC etc, that would give us 64^3 = 262144 possible colours per pixel, which is far higher than the 256 for greyscale, while also accounting for slightly greater differences in contrast. The difference between FEFEFE and FFFFFF in greyscale is not really more noticeable than would be between the full RGB with intervals of 4. As for the circular spectrogram code, that looks great and all, but we have to keep in mind that we're supposed to reconstruct the audio afterwards. Something like that simply would not do, as it completely discards the phase data (which is instead expressed as hue in my program), meaning trying to reconstruct would cause popping noises literally everywhere due to the desync of waveforms, and the way the data is arranged also limits the amount of space you have within the audio. There is also the additional problem presented where if you rotate the final image, the program would no longer have a sense of where to begin and end, and as with all frequency-time interval reconstructions, the slightest extra few pixels in the input's width (unavoidable due to being taken as a photo), would desync the entire thing, and cause all the frequencies to be messed up. Reconstructing from fft has much lower fault tolerance than reconstructing from pcm, period.

ghost commented 2 years ago

Hmm I see, thanks for the explanation. Is there a way the data could be coded semi-randomly, like some form of steganographic noise? Is it possible to have the beginning as a defined "area" as opposed to a single point? Stored in concentric bands, like a bullseye? What I mean is like low frequencies in the middle, high at the outside.
I figured the differences in lighting conditions for color would be much harder at high density opposed to grayscale, at least for optical sound history it seems its all grayscale.
Maybe you're right that there needs to be markers contained in the FFT to tell what angle the image is at and where you are in the image, that sounds very complicated..

https://www.ceremade.dauphine.fr/~waldspurger/wavelets_phase_retrieval.html
I had found this a while back. There used to be images, and if I'm remembering correctly, one of the scalograms looked like a circle, but I could be misremembering, or maybe thats not even how the sound was organized. :D

thomas-xin commented 2 years ago

Coding the data semi-randomly only works for human visual perception. Our eyes are trained to recognise shapes in images even if they aren't fully formed. Our ears however, require very precisely constructed and aligned waveforms to be able to hear a sound as a musical note. Slight disturbances will completely throw off the brain; it's already doing a ton of work basically translating the seismograph through a wavelet transform and analysing overtones/harmonics so we can hear pitches and tone colours! The thing is, we don't hear audio in 3 frequency bands like we see light. We hear audio at very very precise frequency intervals, and that means we can't disturb sound much if we're still to be able to reconstruct it in a way we'll be able to hear as the original. This is why video formats like mp4 are able to compress video so well (around 1/300\~1/500 of the original raw data size), whereas even the most sophisticated audio formats are only able to get the size down to 1/6~1/8 of the raw pcm size before the human ear starts picking up obvious differences. Add on potential losses in every category (rotation, stretching, distorting, brightness differences, blurriness) with an incredibly high data density, and you get yourself an almost impossibly difficult task. I would assume this is why such a feat has never really been successfully attempted.

ghost commented 2 years ago

I definitely came to the right place haha! I love learning about this stuff, thankyou. But at least its almost impossibly difficult and not completely! Amiright or amiright?
Well I think stretching doesn't matter as much as rotation, I feel like stretched/warbly or sped up/slowed down outputs are almost a feature as opposed to something to get rid of. Distortion of the sound would be nice to keep low though. Blurriness should only be minimum if the sound is of a certain smaller length/of lesser data, and the optics just seem to be getting better. Im an optimist!
I had assumed brightness would be easier with grayscale at least for FFT, but thats just what I had been told by other people.

thomas-xin commented 2 years ago

That's perfectly fine! I appreciate someone wanting to have a talk to me regarding these things haha In terms of stretched inputs though, if we were to stretch part of a pcm output, sure, we'd get slightly slowed audio that's barely noticeable. But as I mentioned previously, if you stretched part of an fft representation that requires discrete time intervals, what happens is that you desync every audio segment afterwards, causing their frequencies to go completely out of order. The higher frequences would be read as lower frequencies, and the audio would be almost, if not completely unrecognisable. I personally would say full colour would be more difficult to implement, but more efficient than, greyscale. But yeah, the biggest problem with fft representations is that they have very low tolerance for error. This is why record players always use pcm. If something goes wrong, it only goes wrong in that one place. It doesn't corrupt audio across the entire disk.

ghost commented 2 years ago

Ohh, I understand now. Oh that is quite a problem, I see. Is there no way to clump all the low frequencies in the middle and the high frequencies on the outside? Like clump everything beforehand, then unclump after the image has been taken, like steganographic in a way? That probably makes no sense. I had figured lower res image=lower res audio, higher res image=higher res audio type of thing.

thomas-xin commented 2 years ago

Oh, I mean sure, you could do something like that, but again the problem with fft is not noise, it's desync. Separating high and low frequencies might mitigate the issue slightly, but any faults in the time domain would just cause future frequencies to be completely scrambled up. Maybe you could avoid the highest frequencies being shifted into the lowest, but you'd still have all the frequencies out of order, which would not make for very good audio reconstruction. On top of that there is again the problem with phase reconstruction, which with any disturbance/desync would add a ton of popping noises on top of everything, so that's another problem. FFT is good for storing data, as well as compressing data in a lossy way. What it's not good for, however, is storing data in a way that can be easily retrieved if any further random losses occur.

ghost commented 2 years ago

Gotcha. Sorry you had to repeat yourself, hard to wrap my head around this one. So I guess the question is, what orientation causes the least desync for fft, or would every orientation cause roughly the same amount of desync in this case since alignment is always off? Can fft audio data account for misalignment or are you saying that is impossible? I imagine https://www.dropbox.com/s/ekglkfvphriohah/ModernArt.png?dl=0 (being variable density cubes) would pose issues?
I'm just thinking maybe breaking it into small cubes or clumps of some sort would make sense

thomas-xin commented 2 years ago

Actually now that I think about it, what one would need to solve the desync problem with fft is an indicator for every segment of audio... and that is actually solved if one arranges the packets in square blocks as you mentioned! I'm not sure how placing it in squares would work for rotation though, since rotating the whole image would disrupt the order of the squares. Maybe one would use the centre square for alignment, to indicate which direction is up/down?

ghost commented 2 years ago

Yeah thats exactly what I was thinking, a center alignment square with the necessary info, could even be binary!...I guess there likely isnt another way. I'm trying to think of this whole "generalized area" idea. But that would use so much extra data, my brain is already on full steam lol.

thomas-xin commented 2 years ago

I mean I could give it a shot sometime if that's something you'd wanna see being tried out, no promises though :P

ghost commented 2 years ago

Amazing. Prepare the clumper!
I'm gonna think if there are other ways that don't involve an alignment square in the middle, but I doubt I'll find an answer inside myself (so poetic). Thanks for the chat so far!

ghost commented 2 years ago

What about: https://github.com/xiaoyu258/GeoProj ? From the paper: Our system is trained to correct barrel distortion (B), pincushion (Pi), rotation (R), shear (S), perspective (P) and wave distortion (W).