unslothai / unsloth-zoo

Utils for Unsloth
GNU Affero General Public License v3.0
4 stars 3 forks source link

Fix longest common substring implementation #4

Closed rlrs closed 2 days ago

rlrs commented 5 days ago

This should fix https://github.com/unslothai/unsloth/issues/1015

The issue was that the existing LCS implementation would find a leading comma, leading it to call int(''). Since I was performing this fix anyways, I took the liberty of implementing it a bit cleaner. Instead of finding the LCS of a string repr of a list of tokens, we just find the least common sublist of tokens directly. The string/int conversions are not necessary anymore.

I also added a very simple sanity test at the bottom of the file.

danielhanchen commented 3 days ago

Thanks a lot - will take a look!!

danielhanchen commented 2 days ago

@rlrs Thanks a lot! This definitely is much better than my previous impl :) I noticed that adding substring = [int(x) for x in substring if x.isdigit()] would have solved the issue, but your impl definitely doesn't need to perform in the "string" space (ie via the longest common substring).

I left the old impl just in case I might have to refer back to it!