`/get` endpoint does not return the best match to the query

snejus commented 5 days ago

Hi @tranxuanthang, following your suggestion under https://github.com/beetbox/beets/pull/5406 I now attempt to find matching lyrics using the /get endpoint, and only perform /search if they could not be found.

Thanks to synced lyrics availableion this database, the other day I added lyrics display in my music widget, which depends on accurate timestamps, and I noticed that lyrics are out of sync for some tracks.

One of them was this track:

Artist: Armin van Buuren Feat. Laura V Title: Drowning (Avicii Remix) Album: A State Of Trance Classics 14 Duration: 473.0

I checked and found its lyrics were fetched using the /get endpoint:

$ curl https://lrclib.net/api/get \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" \
  --url-query album_name="A State Of Trance Classics 14" | 
  jq '{albumName, artistName, trackName, duration}'

{
  "albumName": "Mirage (The Remixes) [Bonus Tracks Edition]",
  "artistName": "Armin van Buuren feat. Laura V",
  "trackName": "Drowning - Avicii Remix",
  "duration": 472.0
}

Note that I receive the same data when I provide the duration field:

$ curl https://lrclib.net/api/get \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" \
  --url-query album_name="A State Of Trance Classics 14" \
  --url-query duration=473 | 
  jq '{albumName, artistName, trackName, duration}'

{
  "albumName": "Mirage (The Remixes) [Bonus Tracks Edition]",
  "artistName": "Armin van Buuren feat. Laura V",
  "trackName": "Drowning - Avicii Remix",
  "duration": 472.0
}

When I perform the search for the artist and title

curl https://lrclib.net/api/search \
  --url-query artist_name="Armin van Buuren Feat. Laura V" \
  --url-query track_name="Drowning (Avicii Remix)" | 
  jq 'map({id, albumName, artistName, trackName, duration})' | 
  table

I see the following data

The lyrics I'm after are under id 12429604, and it seems like it should be the closest match to my query. I can provide more examples if required.

The results ranking algorithm I added in https://github.com/beetbox/beets/pull/5406 picks up the correct lyrics.

snejus commented 5 days ago

https://github.com/user-attachments/assets/598277b4-be15-4606-a51b-31646ae51c9e

That's the widget I mentioned, you can see how it depends on correct timestamps.

tranxuanthang commented 5 days ago

Your song album name is A State Of Trance Classics 14.

The track ID 12429604 album name is A State of Trance: Classics, Volume 14.

After normalizing, these became a state of trance classics 14 and a state of trance classics volume 14. Because of the extra word volume, LRCLIB doesn't consider the ID 12429604 a match. It then retry without album name, and you finally get the ID 1029622.

The best way to resolve this in my opinion is resubmitting the correct lyrics for your song's metadata, for example with LRCGET:

Find your song Drowning (Avicii Remix) in the LRCGET song list, then use the search lyrics feature for this song
Apply the matching lyrics (ID 12429604)
Resubmit the lyrics by going to Lyrics Editor > Publish

snejus commented 5 days ago

How come does it match album Mirage (The Remixes) [Bonus Tracks Edition] instead?

In addition to this, neither the track name nor the duration returned by the /get endpoint match the query. Meanwhile, there is a record in the database that matches them exactly.

I was wondering how does the matching/comparison logic work internally; which fields are prioritised for the comparison?

tranxuanthang commented 5 days ago

How come does it match album Mirage (The Remixes) [Bonus Tracks Edition] instead?

It just retries one more time, ignoring the album name parameter. The ID 1029622 is probably the first record that matches the criteria. The duration 472 vs 473 seconds is considered good enough (±2 seconds).

https://github.com/tranxuanthang/lrclib/blob/0f567bc66797b26c8d894486babb9798a251cd01/server/src/routes/get_lyrics_by_metadata.rs#L51-L54

snejus commented 5 days ago

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

Do you reckon we could prioritize exact matches here?

snejus commented 5 days ago

I would be more than happy to contribute!

tranxuanthang commented 5 days ago

Meanwhile, there is a record in the database that matches them exactly.

Unfortunately it is not really exact, because of the extra word "volume".

LRCLIB doesn't deduplicate the metadata, it is a very difficult matter that also requires contribution from community, and someone else does this better already (musicbrainz). Even if it could, there might be still minor syncing issue because of differences between CD rips and musics downloaded from digital/streaming platform.

I know it sucks, I hate the fact that there are usually multiple duplicated lyrics records for the same song in LRCLIB. But this issue is almost impossible to resolve.

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

All of the strings are normalized (converting to lowercase, removing special characters and accents from accented character). In your case:

Drowning (Avicii Remix) will become drowning avicii remix
Drowning - Avicii Remix will become drowning avicii remix

So they are considered an exact match.

The part of the code that does the normalization is here:

https://github.com/tranxuanthang/lrclib/blob/0f567bc66797b26c8d894486babb9798a251cd01/server/src/utils.rs#L7-L20

tranxuanthang commented 5 days ago

I would be more than happy to contribute!

I'd love to have your contribution! But, we need to come to an agreement on the best way to address this first.

snejus commented 5 days ago

Meanwhile, there is a record in the database that matches them exactly.

Unfortunately it is not really exact, because of the extra word "volume".

LRCLIB doesn't deduplicate the metadata, it is a very difficult matter that also requires contribution from community, and someone else does this better already (musicbrainz). Even if it could, there might be still minor syncing issue because of differences between CD rips and musics downloaded from digital/streaming platform.

I know it sucks, I hate the fact that there are usually multiple duplicated lyrics records for the same song in LRCLIB. But this issue is almost impossible to resolve.

I was aware of the duration comparison, but it's surprising to me that the difference in the trackName is ignored, since my query is Drowning (Avicii Remix) but it returns Drowning - Avicii Remix.

All of the strings are normalized (converting to lowercase, removing special characters and accents from accented character). In your case:

Drowning (Avicii Remix) will become drowning avicii remix

Drowning - Avicii Remix will become drowning avicii remix

So they are considered an exact match.

The part of the code that does the normalization is here:

https://github.com/tranxuanthang/lrclib/blob/0f567bc66797b26c8d894486babb9798a251cd01/server/src/utils.rs#L7-L20

This makes a lot of sense.

My last straw then is the duration - given that normalised artist and track names are the same, could we prioritize results that match the duration exactly?

tranxuanthang / lrclib

`/get` endpoint does not return the best match to the query #26