karaoke-dev / LrcParser

A package to encode and decode the .lrc format.
MIT License
3 stars 3 forks source link

Parse fails when files have multiple timestamps for single line of text #40

Open 1hitsong opened 1 year ago

1hitsong commented 1 year ago

When attempting to parse a lyric file that has the following line, the parser failed.

[00:51.00][01:29.99][01:48.29][02:31.00][02:41.99]You gotta fight !

The expected outcome was for the parser to use the text "You gotta fight !" at each of the five time stamps.

andy840119 commented 1 year ago

Thanks for report the issue! This issue will be fixed in the #41 and make a new release soon ๐ŸŽ‰ .

1hitsong commented 1 year ago

Awesome! Thanks for hopping on this so quickly

Looking at the PR for the fix, would this result in the output being option 1 or option 2?

Option 1 [00:51.00]You gotta fight !

or Option 2 [00:51.00]You gotta fight ! [01:29.99]You gotta fight ! [01:48.29]You gotta fight ! [02:31.00]You gotta fight ! [02:41.99]You gotta fight !

andy840119 commented 1 year ago

The result will be the option 1.

andy840119 commented 1 year ago

And note that new NuGet package has been released!

see: https://www.nuget.org/packages/LrcParser

1hitsong commented 1 year ago

The result will be the option 1.

Cool. Just making sure I'm following.

I've seen a lot of sites that use the multiple timestamps in a way that means option 2, but If that's not to the LRC standard, then option 1 makes the most sense.

andy840119 commented 1 year ago

Guess this issue is fixed ๐Ÿš€

Shadowghost commented 7 months ago

@andy840119 in my opinion this is not the correct fix for the issue. Implementing Option 2 mentioned above would make the lib adhere to the basic LRC format. Multiple timestamps on the same line should result in that line appearing on those timestamp. Dropping all except the first timestamp removes quite any chorus lyrics from songs after the first occurrence.

andy840119 commented 7 months ago

https://en.wikipedia.org/wiki/LRC_(file_format) Seems there's two usage for .lrc format

One is for sceeen text:

[00:21.10][00:45.10]Repeating lyrics (e.g. chorus)

Another one is for karaoke format:

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word <mm:ss.xx>

But (as i remember) some software might export the .lrc with those format as well:

[mm:ss.xx] [mm:ss.xx] line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

Also, did you have idea for handling those cases? And will you willing to make the PR for that because currently I'm focusing on another project๐Ÿ˜ข Might have no enough time to solve this issue.

https://suwa.pupu.jp/RhythmicaLyrics.html This is the software I made the .lrc file before.

Shadowghost commented 7 months ago

I already took a look at the codebase and it likely requires some API changes. I'll try to come up with a PR over the weekend.

andy840119 commented 7 months ago

I already took a look at the codebase and it likely requires some API changes. I'll try to come up with a PR over the weekend.

I think in this case it should not repeat:

[mm:ss.xx][mm:ss.xx] line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

Did you have any idea for handle this case? Or can you describe how will you do for this case in order to reduce the review time?

Api change is OK for me, but would be better to explain why ๐Ÿ‘Œ๐Ÿผ

Shadowghost commented 6 months ago

This is taking abit longer than expected, but what I did up until now:

Looking at your post above I've got a question: According to wikipedia karaoke format is like this (duration infront of word)

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word

But according to your example it can also be this:

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word <mm:ss.xx>

What are the timestamps in this case for?

Chaphasilor commented 6 months ago

Just wanted to chime it that I had a user whose files contained lines in the format of [ms:ss.xxx]text, so with three decimals for milliseconds and no space between timestamp and text. They were not parsed correctly with Jellyfin 10.9, and getting rid of the third decimal place fixed the issue.
I'm guessing this is also an issue with this library, and could easily be fixed as part of these improvements :)

andy840119 commented 6 months ago

https://github.com/karaoke-dev/sample-beatmap/blob/master/v1/%E6%9E%AF%E3%82%8C%E3%81%AA%E3%81%84%E8%8A%B1/lyric.lrc @Shadowghost this is the sample .lrc created by RhythmicaLyrics, a tool to create karaoke format .lrc file.

I' not sure if official .lrc format allows [mm:ss.xx] at the end of the text, but you can see the japanese karoake song and last time at the end of the lyric might means it's the time blue area covers the whole lyric.

Shadowghost commented 6 months ago

@Chaphasilor already took care of that by extending the regex.

@andy840119 seems like we need to differentiate between these layouts:

  1. Core LRC: Single timestamp at the beginning

    [mm:ss.xx] text
  2. Core LRC: Multiple timestamps at the beginning (resulting in multiple appearances of the line)

    [mm:ss.xx][mm:ss.xx][mm:ss.xx] text
  3. Enhanced LRC (A2): Single timestamp at the beginning, followed by others denoted by <...> - within them is the DURATION, therefore NO need for a timestamp at the end

    [mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word
  4. Enhanced LRC (RythmicaLyrics): Single timestamp at the beginning, followed by others denoted by [...] - within them are ABSOLUTE timestamps, therfore we need an end timestamp to know where the displaying of the last character ends

    [mm:ss.xx]line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

My plan is to convert all of these into the following object:

TextIndex:
    public int Index { get; }
    public int? Time { get; }
    public IndexState State { get; }

IndexState:
    Start
    End

Always using the full timestamp for time and IndexState Start, for cases 1 and 2 we have no entry with IndexState End, for cases 3 and 4 we have

Shadowghost commented 6 months ago

Logic for knowing which of the above we are dealing with (based on regexing one line a time):

andy840119 commented 6 months ago

@Shadowghost

image Finally I got some times to check what's happening in the RythmicaLyrics This tool have two way to generate the file:

  1. save as(A)
  2. export(O)

image If press the save button, it can support save as different type of format, but the contents are the same:

[mm:ss.xx]line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

image And the save format match to the export as text2ass with .kar extension.

The conclusion:

  1. RythmicaLyrics does not actually support export the .lrc format๏ผŒthey only support another file format (.kar)
  2. I think should be better to rename the origin LrcParser into KarParser, and make new parser for the Core LRC and Enhanced LRC (A2)
andy840119 commented 6 months ago

Maybe it's a good start:

  1. Create a first PR to create the KarParser(just copy the conten/tests in the LrcParser)
  2. Create another PR to adjust the LrcParser, let it fully matched the Core LRC and Enhanced LRC (A2) spec.
Shadowghost commented 6 months ago

Fine for me, will do so. One last question: is the ruby tag stuff also KOR specific?

andy840119 commented 6 months ago

There's not much documentation about .kar file format and maybe it's not as popular as .lrc file

I think is's OK to treat it as official spec And ruby tag might not supported in the .lrc file

Reapor-Yurnero commented 4 months ago

Maybe it's a good start:

  1. Create a first PR to create the KarParser(just copy the conten/tests in the LrcParser)
  2. Create another PR to adjust the LrcParser, let it fully matched the Core LRC and Enhanced LRC (A2) spec.

Since quite a few other projects are relying on this project to parse LRC (as its name indicates), it makes sense to maintain the functionality of LRC parsing in this lib and move the kar part to a seperate one like .

BTW I was not able to find a PR on this created by @Shadowghost and wondering what's the current status of it. A service I'm using is relying on the above discussed items to be incorporated and would love to help!

andy840119 commented 4 months ago

hmmmm.... Guess i need to reserve some time on this library.

@Shadowghost are you still interested on fixing this issue? I'll take over this issue if you have no time to fix it ;_;

Shadowghost commented 4 months ago

Sorry, I've been busy over the last weeks, I can push what I have if you want to take over.

andy840119 commented 4 months ago

Sorry, I've been busy over the last weeks, I can push what I have if you want to take over.

Nice, you can give me the branch and i can cherry-pick if possible to use ๐Ÿ™๐Ÿผ

Reapor-Yurnero commented 4 months ago

Great! Being able to handle A2 extensions (word by word timestamps) and support millesecond timestamps [mm:ss.xxx] or \<mm:ss.xxx> in A2 (as mentioned also by @Chaphasilor) is the most desired feature from the downstream projects. Appreciate your help.

Shadowghost commented 4 months ago

@andy840119 there you go: https://github.com/Shadowghost/LrcParser/tree/extend Sadly it is very WIP, lmk if you have questions.

andy840119 commented 4 months ago

@1hitsong @Shadowghost @Reapor-Yurnero @Chaphasilor

Here's the new release follow the .lrc format in the wiki ๐ŸŽ‰ https://www.nuget.org/packages/LrcParser

Only few changes and should be easy to adjust https://github.com/karaoke-dev/LrcParser/blob/main/CHANGELOG.md

It's time to give it a try, report feedback is welcome โค๏ธ

Shadowghost commented 4 months ago

@andy840119 thanks for the update, working fine to far on my local tests. I do have one question though: What was the reason for you to not re-use the TimeTags for multiple occurrences?

andy840119 commented 4 months ago

What was the reason for you to not re-use the TimeTags for multiple occurrences?

What does that means? Using TimeTag class in the LrcLyric Or Allow have multiple time-tags between lyric text?

Shadowghost commented 4 months ago

I think I misunderstood what TimeTags is for. Why I'm mentioning it is this: Before it was possible to get LRC metadata tags like these:

[ti:Fortnight]
[ar:Taylor Swift/Post Malone]
[al:THE TORTURED POETS DEPARTMENT (Explicit)]
[offset:0]

by decoding the LRC (result would have all lines in the Lyrics object), ckecking the lines for TimeTags.Count == 0 and then parsing their text content. Now these lines are completely omitted and TimeTags.Count is always 0 if the LRC has timestamps. Only way to get them would be to parse the full LRC text.

We don't actually need them but it would be nice for completeness to have them available somewhere

andy840119 commented 4 months ago
[ti:Fortnight]
[ar:Taylor Swift/Post Malone]
[al:THE TORTURED POETS DEPARTMENT (Explicit)]
[offset:0]

That's called metadata, and i guess should make another parser for it (e.g. LrcMatadataParser in the LrcParser/Parser/Lrc/Lines/)

ckecking the lines for TimeTags.Count == 0 and then parsing their text content.

I'm not sure what it's means (sorry for my bad english understanding), but TimeTags in the Lyric.cs still needed if lyric contains the A2 extension formant

e.g.

[00:13.34] <00:14.32> Don't <00:14.73> you <00:15.14> want <00:15.57> somebody <00:16.09> to <00:16.46> love

Lyric's start time will be [00:13.34] And Lyric's first time-tag will be [00:13.34] + <00:14.32> in 2024.728.1

[update] Wait, i found a bug in this version. <mm:ss.xx> seems should be absolute time, not relative time Orz.

[update] https://www.nuget.org/packages/LrcParser New version released (2024.728.2)

Shadowghost commented 4 months ago

I'm not sure what it's means (sorry for my bad english understanding), but TimeTags in the Lyric.cs still needed if lyric contains the A2 extension formant

Yes, that's what I mean with I misunderstood their usage ๐Ÿ˜…

That's called metadata, and i guess should make another parser for it (e.g. LrcMatadataParser in the LrcParser/Parser/Lrc/Lines/)

Alternatively you could parse it in the existing parser and add a Metadata object to the parser result in addition to Lyrics - would preent us from reading the file two times if we want both.

Reapor-Yurnero commented 4 months ago

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it. Metadata should in the end be embedded in the music file itself rather than being double embedded in the lyrics.

andy840119 commented 4 months ago

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it.

I think the next step might let LrcParser able to prese metadata, for prevent treating metadata as lyric text. As your mention, song info is able to get from the music file itself, so there's no need to show in the lyric again.

Reapor-Yurnero commented 4 months ago

In fact, i think it does have its merit to have some metadata shown in the lyrics such as composer, lyrics writer etc. which are typically not part of the music metadata itself.

We may just skip these none timestamp brackets and do not parse them at all.

Shadowghost commented 4 months ago

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it. Metadata should in the end be embedded in the music file itself rather than being double embedded in the lyrics.

As your mention, song info is able to get from the music file itself, so there's no need to show in the lyric again.

Usually lyrics are not shared with the associated media file due to copyrights. Adding the metadata to the music file is therefore not really a valid argument. I'd even argue that lyrics without such metadata are kind of useless if different edits of a song exist.

We may just skip these none timestamp brackets and do not parse them at all.

That is already the case for synced lyrics.