ivrit-ai / ivrit.ai

ivrit.ai codebase
MIT License
24 stars 9 forks source link

Knesset Plenum protocols datasource processing #38

Closed yoadsn closed 1 month ago

yoadsn commented 1 month ago

This PR contributes 2 parts and modifies 3 others:

yoadsn commented 1 month ago

Thanks for the feedback!

transcribe changes

Let me give some background on that to maybe save you some questions. It looks like a lot of changes since I moved some code to the top (I follow the code style that first defines functions in the file before using them in subsequent functions.) Since for me transcription is an over-night run, I implemented a "state run" using Pandas that dumps once in a while to a parquet file. At the end - I repackage it in a compatible JSON file with existing structure. This made the process more robust to crashes. Personally, I would always use something like parquet or hdf5 files to capture all artifacts - but of course I would promote such change without you really wanting to go that path. It's using the OpenAI compatible API, keeping the ability to use the "homegrown API" just in case we need that back comp initially. I also encapsulated the transformation of the API response to the expected format the current implementation dumps so the file contents would be compatible with older transcription artifacts. Finally, I introduced an optional "sleep" between transcription batches so the machine would not overheat and have fans running all night - again - this is optional and convenience.

process_streaming

Sounds good - I will take process_streaming as a base since it contains other things I think could be useful to keep. Remove the sliding window approach and instead create a preprocess step that generates the file ussing ffmpeg that will not require mono/down-sampling. Once I have that we can re-review to see if we can get rid of more code. End game here is to replace "process" with that final iteration of "process_streaming" - Let me know if this works. I am starting on that as soon as I get more keyboard time.

Formatting Using Black

Done (For files I have touched/created)

yoadsn commented 1 month ago

@yairl Please note "process_streaming" changes done.

yoadsn commented 1 month ago

@yairl I believe all we covered is now done. Let's give it another shot - and when we are good - can merge this in.