ve1ld / vyasa

a home for the timeless fables, hymns and the traditions that bind us as a people
https://vyasa.tv
3 stars 0 forks source link

[Operation]: Chalisa Intermediate Representation (Timestamp to Source + Foundation Indexer) #21

Closed ks0m1c closed 9 months ago

ks0m1c commented 10 months ago

Present Context

Helping words... Where we collectively investigate and interrogate the problem space and iteratively scope our approach. Breakdown to landmarks that communicate shared context we are working towards through 2-tiered task list, CRUD list elements as development unfolds. Strike the scope of code that reveals the most about the problem/solution FIRST not necessarily the easiest or hardest parts ``` ``` ---

Groundwork

Helping words... Introduce us to the problem space. Write out what you already know about the terrain you are the recce commander enriching us with details beyond the fog of war. Where have you tried applying and encountered difficulties? How have others attempted to scale or explore these challenges? (Embed internal & external links to related or possible paths of exploration, stackoverflow, documentation, github etc) Who should be notified? Emphasis on previous or current practice to discover what is ugly, missing, or unnecessary.

Downloading auto-generated YouTube transcriptions that which we bild

Reflection

Helping words... Where the eternal wheel returns back to practice and what we finally implemented is to be outlined. You are the historian or archivist bringing clarity to future-yous and us about your foray. Emphasis on approaching timeless solutions for well-defined problem space through distillation by decanting that which is un-needed and abstracting that which is essential to approaching the problem space. Add any reflections and internal links to future potential and blindsides. ``` INPUT UR ANSWER HERE ```
rtshkmr commented 10 months ago

notes:

  1. IR:

    • can just be json for now, it's just a need to pre-processed and stored as a IR
  2. Indexer is more of a intro to ecto

rtshkmr commented 9 months ago

✅ [DONE] Creating .srt from the caption data

Did up a script for .srt creation. input file for script: chalisa.json output file for script: (see commit)

Observations about segments::

  1. A negligible minority of the events won't have segments (only the first one it seems.)
  2. [the text has instrumentals] there are text-events that just say "[संगीत]", which refers to instrumental music. I'm guessing that similar stuff will be there in other languages as well. If there's a need to filter these commented parts out, then we should be able to do so by just filtering away whatever is within []
  3. some events will overlap when displaying captions, that's why they are not chronologically unique segments.
  4. The error part seems a little non-trivial, don't have an algo in mind to do the mapping yet...

Possible Dangers:

  1. seems like bhajans will repeat certain stanzas, not sure how to account for that yet [for the mapping part]
ks0m1c commented 9 months ago

Further thoguhts on specific observations: 1) Instrumentals is great to know when to space out or buffer and span out the timestamps 2) Lack of chronological unity can be compensated with our error correcting look for first candidate from verse and span verse time stamps till next deviation in time 3) Error-correcting is probably the name of the game for this artform

Dangers:

1) When theres repeating verses detected during audio snippets resolve to the same verse number so that verse to timestamp is 1:many relationship

rtshkmr commented 9 months ago

Mapping Scraped Data to Autogenerated Data

The goal here is to map the fixed-structured, perfect data to the correct part of the .srt file. Since we are working with .srt files, which naturally already have events that are indexed, a first pass of this for this is to just map each verse to the correct event index.

In order to do this, we need to some sentence similarity matching. The underlying idea would be to use some sort of edit distances (jaccard, levenshtein) or other distance metrics like cosine distance.

After a quick check, it's better to use a recent library that supports indic texts for this first pass.

Here are some candidates:

  1. libindic/soundex
  2. inltk -- it's more recently updated and the library's stats are a little better.

unrelated but interesting stuff found:

  1. there's a whole AI for bharat thing going on here -- https://ai4bharat.iitm.ac.in/areas/transliteration/
Tests 1. to map scraped verse number 5 to the event 46 in the captions image
rtshkmr commented 9 months ago

conclusion [1] : inltk is likely a dead-end, see adventures below

macos fails, linux seems to have outdated package name 😢 ![image](https://github.com/ve1ld/vyasa/assets/38996397/533d3ac4-833f-478a-8521-7a57eb06e589) even after using scikit-learn, the underlying dep uses python version that needs less than 3.10 because of a core lib deprecation ![image](https://github.com/ve1ld/vyasa/assets/38996397/fa316ff8-576c-4b83-969d-ee9b7708f9a9) changed my python using pyenv to 3.9.* and it seems to somewhat work w.r.t the deps buttttt the code using pytorch / pytorch code has issues and so we get a runtime error again... ![image](https://github.com/ve1ld/vyasa/assets/38996397/f8344184-1a62-4297-8708-10f468193eeb) at this point, I should just write it out myself and just use edit distance as a measure... in the case of inltk it seems like they're actually just using fastai to train their model... ![image](https://github.com/ve1ld/vyasa/assets/38996397/a6a57387-87f4-4ffc-aa61-2b0d4a81c5b7)

conclusion [2]: the libindic works at least but not too well

ref this thread: https://github.com/ve1ld/vyasa/pull/22#discussion_r1456698940

ks0m1c commented 9 months ago

In terms of supporting keyboard shortcuts livebook serves as a great example

image

demonstrated thru a declarative shortcuts schema and component

https://github.com/livebook-dev/livebook/blob/72229cc2f134b7b0e9f06368e3f16a6f9fef0835/lib/livebook_web/live/session_live/shortcuts_component.ex#L207